我希望能够滚动计算一系列数据的排名。
假设我有一个Pandas Series:
In [18]: s = pd.Series(np.random.rand(10))
In [19]: s
Out[19]:
0 0.340396
1 0.664459
2 0.647212
3 0.529363
4 0.535349
5 0.781628
6 0.313549
7 0.933539
8 0.618337
9 0.013442
dtype: float64
我可以像这样使用pandas内置函数rank:
In [20]: s.rolling(4).apply(lambda x: pd.Series(x).rank().iloc[-1])
<ipython-input-20-41df4deb36f8>:1: FutureWarning: Currently, 'apply' passes the values as ndarrays to the applied function. In the future, this will change to passing it as Series objects. You need to specify 'raw=True' to keep the current behaviour, and you can pass 'raw=False' to silence this warning
s.rolling(4).apply(lambda x: pd.Series(x).rank().iloc[-1])
Out[20]:
0 NaN
1 NaN
2 NaN
3 2.0
4 2.0
5 4.0
6 1.0
7 4.0
8 2.0
9 1.0
dtype: float64
这还可以,但速度很慢,这里有一个测试。
In [24]: %timeit pd.Series(np.random.rand(100000)).rolling(100).apply(lambda x: pd.Series(x).rank().iloc[-1])
<magic-timeit>:1: FutureWarning: Currently, 'apply' passes the values as ndarrays to the applied function. In the future, this will change to passing it as Series objects. You need to specify 'raw=True' to keep the current behaviour, and you can pass 'raw=False' to silence this warning
22.5 s ± 292 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
有没有好的方法可以加速,我认为滚动循环有一些可以改进的地方。谢谢
pip install numpy==1.20.0rc2
吗? - David M.