为什么DataFrame.loc[[1]]比df.ix[[1]]慢1800倍，比df.loc[1]慢3500倍？

Question

为什么DataFrame.loc[[1]]比df.ix[[1]]慢1800倍，比df.loc[1]慢3500倍？

12

自己试试：

import pandas as pd
s=pd.Series(xrange(5000000))
%timeit s.loc[[0]] # You need pandas 0.15.1 or newer for it to be that slow
1 loops, best of 3: 445 ms per loop

更新：这是pandas中的一个合法错误，可能在2014年8月左右的0.15.1版本中引入。解决方法：等待新版本发布，同时使用旧版本的pandas；从github获取最新的开发版本；手动修改你的pandas版本中的一行代码；暂时使用.ix代替.loc。

我有一个包含480万行的DataFrame，使用.iloc[[id]]（用单元素列表）选择单个行需要489毫秒，几乎半秒钟，比相同的.ix[[id]]慢1,800倍，比.iloc[id]（将id作为值而不是列表传递）慢3,500倍。公平地说，.loc[list]花费的时间大致相同，无论列表的长度如何，但我不想在上面花费489毫秒，特别是当.ix快一千倍，并且产生相同的结果。我的理解是.ix应该更慢，不是吗？

我正在使用pandas 0.15.1。索引和选择数据的优秀教程表明，.ix在某种程度上比.loc和.iloc更通用，也可能更慢。具体来说，它说：

然而，当轴是基于整数时，仅支持基于标签的访问而不是基于位置的访问。因此，在这种情况下，最好明确使用.iloc或.loc。以下是带有基准测试的iPython会话：

    print 'The dataframe has %d entries, indexed by integers that are less than %d' % (len(df), max(df.index)+1)
    print 'df.index begins with ', df.index[:20]
    print 'The index is sorted:', df.index.tolist()==sorted(df.index.tolist())

    # First extract one element directly. Expected result, no issues here.
    id=5965356
    print 'Extract one element with id %d' % id
    %timeit df.loc[id]
    %timeit df.ix[id]
    print hash(str(df.loc[id])) == hash(str(df.ix[id])) # check we get the same result

    # Now extract this one element as a list.
    %timeit df.loc[[id]] # SO SLOW. 489 ms vs 270 microseconds for .ix, or 139 microseconds for .loc[id]
    %timeit df.ix[[id]] 
    print hash(str(df.loc[[id]])) == hash(str(df.ix[[id]]))  # this one should be True
    # Let's double-check that in this case .ix is the same as .loc, not .iloc, 
    # as this would explain the difference.
    try:
        print hash(str(df.iloc[[id]])) == hash(str(df.ix[[id]]))
    except:
        print 'Indeed, %d is not even a valid iloc[] value, as there are only %d rows' % (id, len(df))

    # Finally, for the sake of completeness, let's take a look at iloc
    %timeit df.iloc[3456789]    # this is still 100+ times faster than the next version
    %timeit df.iloc[[3456789]]

输出：

The dataframe has 4826616 entries, indexed by integers that are less than 6177817
df.index begins with  Int64Index([1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20], dtype='int64')
The index is sorted: True
Extract one element with id 5965356
10000 loops, best of 3: 139 µs per loop
10000 loops, best of 3: 141 µs per loop
True
1 loops, best of 3: 489 ms per loop
1000 loops, best of 3: 270 µs per loop
True
Indeed, 5965356 is not even a valid iloc[] value, as there are only 4826616 rows
10000 loops, best of 3: 98.9 µs per loop
100 loops, best of 3: 12 ms per loop

- Sergey Orshanskiy

请注意，使用 [[id]] 和 [id] 不是等价的。[id] 将返回一个 Series，但 [[id]] 将返回一个一行的 DataFrame。 - BrenBarn

@BrenBarn，是的，这解释了.ix的差异：141微秒与270微秒。但为什么.loc[[id]]如此缓慢？ - Sergey Orshanskiy

2个回答

9

Pandas的索引速度非常慢，我转用了numpy的索引方式

df=pd.DataFrame(some_content)
# takes forever!!
for iPer in np.arange(-df.shape[0],0,1):
    x = df.iloc[iPer,:].values
    y = df.iloc[-1,:].values
# fast!        
vals = np.matrix(df.values)
for iPer in np.arange(-vals.shape[0],0,1):
    x = vals[iPer,:]
    y = vals[-1,:]

- citynorman

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Sergey Orshanskiy · Accepted Answer

看起来在pandas 0.14版本中不存在这个问题。我使用line_profiler对其进行了分析，我认为我知道发生了什么。自从pandas 0.15.1以来，如果给定的索引不存在，就会引发KeyError。看起来当您使用.loc[list]语法时，它会沿着整个轴进行详尽的索引搜索，即使已经找到。也就是说，首先，在找到元素的情况下没有早期终止，其次，在这种情况下搜索是蛮力的。 文件：.../anaconda/lib/python2.7/site-packages/pandas/core/indexing.py，

  1278                                                       # require at least 1 element in the index
  1279         1          241    241.0      0.1              idx = _ensure_index(key)
  1280         1       391040 391040.0     99.9              if len(idx) and not idx.isin(ax).any():
  1281                                           
  1282                                                           raise KeyError("None of [%s] are in the [%s]" %