使用groupby的pandas滚动最大值

Question

使用groupby的pandas滚动最大值

8

我在使用Pandas的rolling函数时遇到了问题，我想要对每一行计算组内迄今为止的最大值。以下是一个示例：

df = pd.DataFrame([[1,3], [1,6], [1,3], [2,2], [2,1]], columns=['id', 'value'])

看起来像是

   id  value
0   1      3
1   1      6
2   1      3
3   2      2
4   2      1

现在我希望获得以下DataFrame：

   id  value
0   1      3
1   1      6
2   1      6
3   2      2
4   2      2

问题在于当我执行以下操作时：

df.groupby('id')['value'].rolling(1).max()

我得到了相同的DataFrame。当我执行以下操作时：

df.groupby('id')['value'].rolling(3).max()

我得到了一个带有Nan值的DataFrame。能否有人解释一下如何正确使用rolling或其他Pandas函数来获取我想要的DataFrame？

- splinter

2

如果你想要类似于“rolling”的东西，你可以像这样使用expanding：df.groupby('id').expanding().max()。但是，进行了一些快速测试后发现它比其他两个答案慢。只是为了参考，因为如果需要的话，“扩展”确实会给你比“cummax”更多的选项（如窗口大小等）。 - JohnE

2个回答

2

使用apply会稍微更快一些：

# Using apply  
df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
%timeit df['output'] = df.groupby('id').value.apply(lambda x: x.cummax())
1000 loops, best of 3: 1.57 ms per loop

其他方法：

df['output'] = df.groupby('id').value.cummax()
%timeit df['output'] = df.groupby('id').value.cummax()
1000 loops, best of 3: 1.66 ms per loop

- Andrew L

1

是时候升级到 Pandas 0.20.1 了 ;) - MaxU - stand with Ukraine

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- MaxU - stand with Ukraine · Accepted Answer

看起来你需要使用 cummax()，而不是 .rolling(N).max()

In [29]: df['new'] = df.groupby('id').value.cummax()

In [30]: df
Out[30]:
   id  value  new
0   1      3    3
1   1      6    6
2   1      3    6
3   2      2    2
4   2      1    2

时间（使用全新的 Pandas 版本 0.20.1）：

In [3]: df = pd.concat([df] * 10**4, ignore_index=True)

In [4]: df.shape
Out[4]: (50000, 2)

In [5]: %timeit df.groupby('id').value.apply(lambda x: x.cummax())
100 loops, best of 3: 15.8 ms per loop

In [6]: %timeit df.groupby('id').value.cummax()
100 loops, best of 3: 4.09 ms per loop

注意: 自Pandas 0.20.0起新增内容

groupby().cummin() 和 groupby().cummax() 的性能得到了提升 (GH15048, GH15109, GH15561, GH15635)