自己试试:
import pandas as pd
s=pd.Series(xrange(5000000))
%timeit s.loc[[0]] # You need pandas 0.15.1 or newer for it to be that slow
1 loops, best of 3: 445 ms per loop
更新:这是pandas中的一个合法错误,可能在2014年8月左右的0.15.1版本中引入。解决方法:等待新版本发布,同时使用旧版本的pandas;从github获取最新的开发版本;手动修改你的pandas
版本中的一行代码;暂时使用.ix
代替.loc
。我有一个包含480万行的DataFrame,使用
.iloc[[id]]
(用单元素列表)选择单个行需要489毫秒,几乎半秒钟,比相同的.ix[[id]]
慢1,800倍,比.iloc[id]
(将id作为值而不是列表传递)慢3,500倍。公平地说,.loc[list]
花费的时间大致相同,无论列表的长度如何,但我不想在上面花费489毫秒,特别是当.ix
快一千倍,并且产生相同的结果。我的理解是.ix
应该更慢,不是吗?我正在使用pandas 0.15.1。索引和选择数据的优秀教程表明,
.ix
在某种程度上比.loc
和.iloc
更通用,也可能更慢。具体来说,它说:然而,当轴是基于整数时,仅支持基于标签的访问而不是基于位置的访问。因此,在这种情况下,最好明确使用.iloc或.loc。以下是带有基准测试的iPython会话:
print 'The dataframe has %d entries, indexed by integers that are less than %d' % (len(df), max(df.index)+1)
print 'df.index begins with ', df.index[:20]
print 'The index is sorted:', df.index.tolist()==sorted(df.index.tolist())
# First extract one element directly. Expected result, no issues here.
id=5965356
print 'Extract one element with id %d' % id
%timeit df.loc[id]
%timeit df.ix[id]
print hash(str(df.loc[id])) == hash(str(df.ix[id])) # check we get the same result
# Now extract this one element as a list.
%timeit df.loc[[id]] # SO SLOW. 489 ms vs 270 microseconds for .ix, or 139 microseconds for .loc[id]
%timeit df.ix[[id]]
print hash(str(df.loc[[id]])) == hash(str(df.ix[[id]])) # this one should be True
# Let's double-check that in this case .ix is the same as .loc, not .iloc,
# as this would explain the difference.
try:
print hash(str(df.iloc[[id]])) == hash(str(df.ix[[id]]))
except:
print 'Indeed, %d is not even a valid iloc[] value, as there are only %d rows' % (id, len(df))
# Finally, for the sake of completeness, let's take a look at iloc
%timeit df.iloc[3456789] # this is still 100+ times faster than the next version
%timeit df.iloc[[3456789]]
输出:
The dataframe has 4826616 entries, indexed by integers that are less than 6177817
df.index begins with Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64')
The index is sorted: True
Extract one element with id 5965356
10000 loops, best of 3: 139 µs per loop
10000 loops, best of 3: 141 µs per loop
True
1 loops, best of 3: 489 ms per loop
1000 loops, best of 3: 270 µs per loop
True
Indeed, 5965356 is not even a valid iloc[] value, as there are only 4826616 rows
10000 loops, best of 3: 98.9 µs per loop
100 loops, best of 3: 12 ms per loop
[[id]]
和[id]
不是等价的。[id]
将返回一个 Series,但[[id]]
将返回一个一行的 DataFrame。 - BrenBarn.ix
的差异:141微秒与270微秒。但为什么.loc[[id]]
如此缓慢? - Sergey Orshanskiy