Pandas：加速df.loc基于重复索引值

Question

Pandas：加速df.loc基于重复索引值

6

我有一个pandas数据框

import pandas as pd
import numpy as np

df = pd.DataFrame({
    'x': ['a', 'b', 'c'],
    'y': [1, 2, 2],
    'z': ['f', 's', 's']
}).set_index('x')

我希望根据选择数组中索引（x）的值选择行。

selection = ['a', 'c', 'b', 'b', 'c', 'a']

正确的输出结果可以通过以下使用df.loc来获得。

out = df.loc[selection]

我遇到的问题是在大型数据框（2-7百万行）上，df.loc 运行得很慢。有没有方法可以加快此操作？我曾经尝试使用 eval()，但它似乎不适用于像这样的硬编码索引值列表。我也考虑过使用 pd.DataFrame.isin，但这会错过重复值（只返回选择中每个唯一元素的一行）。

- philE

你可以在选择中去除重复项。 - postelrich

@riotburn，重复项对于该应用程序是必要的。 - philE

选择行时，如果索引在选择列表中，有必要这样做吗？不能使用 df.loc[list(set(selection))] 吗？ - postelrich

不，重复的部分是有意为之的。'out' 是期望输出结果。 - philE

2个回答

3

你可以尝试使用merge：

df = pd.DataFrame({
    'x': ['a', 'b', 'c'],
    'y': [1, 2, 2],
    'z': ['f', 's', 's']
})

df1 = pd.DataFrame({'x':selection})

In [21]: pd.merge(df1,df,on='x', how='left')
Out[21]: 
   x  y  z
0  a  1  f
1  c  2  s
2  b  2  s
3  b  2  s
4  c  2  s
5  a  1  f

- Colonel Beauvel

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Riley · Accepted Answer

使用reindex而不是loc，可以获得不错的加速效果:

df.reindex(selection)

时间（版本0.17.0）：

>>> selection2 = selection * 100 # a larger list of labels
>>> %timeit df.loc[selection2]
100 loops, best of 3: 2.54 ms per loop

>>> %timeit df.reindex(selection2)
1000 loops, best of 3: 833 µs per loop

这两种方法采用不同的路径（因此速度有所不同）。

loc调用get_indexer_non_unique来构建新的DataFrame，而这个方法比单独针对唯一值的简单方法get_indexer更为复杂。

另一方面，在reindex中的重要工作似乎是由generated.pyx中的take_*函数完成的。这些函数似乎更适合构造新的DataFrame，因此速度更快。