Pandas 多级索引数据框按组滚动平均。

Question

Pandas 多级索引数据框按组滚动平均。

4

我想计算数据框按第二层分组后的滚动均值（以下是代码示例中的Key2）。

import pandas as pd
d = {'Key1':[1,1,1,2,2,2,3,3,3,4,4,4,5,5,5,6,6,6], 'Key2':[2,7,8,5,3,2,7,5,8,7,2,9,8,3,9,2,7,9],'Value':[1,2,3,1,2,3,1,2,3,1,2,3,1,2,3,1,2,3]}
df = pd.DataFrame(d)
df = df.set_index(['Key1', 'Key2'])
df['MA'] = (df.groupby('Key2')['Value']
                .rolling(window=3)
                .mean()
                .reset_index(level=0, drop=True))

print(df)

期望输出:

           Value        MA
Key1 Key2                 
1    2         1       NaN
     7         2       NaN
     8         3       NaN
2    5         1       NaN
     3         2       NaN
     2         3       NaN
3    7         1       NaN
     5         2       NaN
     8         3       NaN
4    7         1  1.333333
     2         2  2.000000
     9         3       NaN
5    8         1  2.333333
     3         2       NaN
     9         3       NaN
6    2         1  2.000000
     7         2  1.333333
     9         3  3.000000

但实际输出为 NaN。似乎赋值出了问题。

           Value        MA
Key1 Key2                 
1    2         1       NaN
     7         2       NaN
     8         3       NaN
2    5         1       NaN
     3         2       NaN
     2         3       NaN
3    7         1       NaN
     5         2       NaN
     8         3       NaN
4    7         1      NaN
     2         2       NaN
     9         3       NaN
5    8         1      NaN
     3         2       NaN
     9         3       NaN
6    2         1      NaN
     7         2       NaN
     9         3       NaN

Python 3.8 + Pandas 1.2.1。（也尝试了 Python 3.7.9 + Pandas 1.1.5）

- Yoh

是的，这是预期的行为。通过rolling(3)，您只能获得Key2中具有>=3行的非NaN值。您可以将min_periods=1传递给rolling(3, min_periods=1)。这符合您的预期吗？ - Quang Hoang

@Quang Hoang，我没有得到预期的输出。请查看更新后的实际输出。 - Yoh

1

代码在我的系统上返回了预期的输出。你可能有不同的 Pandas 版本。你可以尝试打印 groupby().rolling().mean() 系列，看看是否需要使用 reset_index。 - Quang Hoang

请问您的系统中安装了哪个版本的Python和pandas？ - Yoh

1

Python 3.7和Pandas 1.1.4。 - Quang Hoang

我尝试了Python 3.7.9和Pandas 1.1.5，但仍然得到了所有NaN的输出。如果没有使用.reset_index，则会引发异常，提示“无法处理非唯一多索引！” - Yoh

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

使用lambda函数避免丢失MultiIndex，从而确保分配正常工作：

df['MA'] = df.groupby('Key2')['Value'].apply(lambda x: x.rolling(window=3).mean())
print(df)
           Value        MA
Key1 Key2                 
1    2         1       NaN
     7         2       NaN
     8         3       NaN
2    5         1       NaN
     3         2       NaN
     2         3       NaN
3    7         1       NaN
     5         2       NaN
     8         3       NaN
4    7         1  1.333333
     2         2  2.000000
     9         3       NaN
5    8         1  2.333333
     3         2       NaN
     9         3       NaN
6    2         1  2.000000
     7         2  1.333333
     9         3  3.000000