如何加速 Pandas Series 中的排名函数？

Question

如何加速 Pandas Series 中的排名函数？

4

我希望能够滚动计算一系列数据的排名。

假设我有一个Pandas Series：

In [18]: s = pd.Series(np.random.rand(10))

In [19]: s
Out[19]: 
0    0.340396
1    0.664459
2    0.647212
3    0.529363
4    0.535349
5    0.781628
6    0.313549
7    0.933539
8    0.618337
9    0.013442
dtype: float64

我可以像这样使用pandas内置函数rank:

In [20]: s.rolling(4).apply(lambda x: pd.Series(x).rank().iloc[-1])
<ipython-input-20-41df4deb36f8>:1: FutureWarning: Currently, 'apply' passes the values as ndarrays to the applied function. In the future, this will change to passing it as Series objects. You need to specify 'raw=True' to keep the current behaviour, and you can pass 'raw=False' to silence this warning
  s.rolling(4).apply(lambda x: pd.Series(x).rank().iloc[-1])
Out[20]: 
0    NaN
1    NaN
2    NaN
3    2.0
4    2.0
5    4.0
6    1.0
7    4.0
8    2.0
9    1.0
dtype: float64

这还可以，但速度很慢，这里有一个测试。

In [24]: %timeit pd.Series(np.random.rand(100000)).rolling(100).apply(lambda x: pd.Series(x).rank().iloc[-1])
<magic-timeit>:1: FutureWarning: Currently, 'apply' passes the values as ndarrays to the applied function. In the future, this will change to passing it as Series objects. You need to specify 'raw=True' to keep the current behaviour, and you can pass 'raw=False' to silence this warning
22.5 s ± 292 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

有没有好的方法可以加速，我认为滚动循环有一些可以改进的地方。谢谢

- xyhuang

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- David M. · Accepted Answer

使用scipy/numpy会更快（需要最新版本的numpy）：

import pandas as pd
import numpy as np
from time import time
from scipy.stats import rankdata
from numpy.lib.stride_tricks import sliding_window_view

np.random.seed()
array = np.random.rand(100000)

t0 = time()
ranks = pd.Series(array).rolling(100).apply(lambda x: x.rank().iloc[-1])
t1 = time()
print(f'With pandas: {t1-t0} sec.')

t0 = time()
ranks = [rankdata(x)[-1] for x in sliding_window_view(array, window_shape=100)]
t1 = time()
print(f'With numpy: {t1-t0} sec.')

输出：

With pandas: 11.682222127914429 sec.
With numpy: 3.9317219257354736 sec.