如何在pandas中计算一列的分组加权聚合？

Question

如何在pandas中计算一列的分组加权聚合？

3

我是一名主要使用JavaScript的开发人员，正在尝试学习pandas并进行一些数据分析。其中一部分分析包括将团队的比赛表现（胜/负）转换为数值评级（基于胜利百分比）。

简而言之：我想从DF 1变成DF 3。

DF1：

|   season  | opponent  |   outcome |
-------------------------------------
|   2020    |   A       |   w       |
|   2020    |   A       |   l       |
|   2020    |   B       |   w       |
|   2020    |   B       |   w       |
|   2020    |   C       |   l       |
|   2020    |   C       |   l       |
|   2021    |   A       |   w       |
|   2021    |   A       |   w       |
|   2021    |   B       |   w       |
|   2021    |   B       |   l       |
|   2021    |   C       |   w       |
|   2021    |   C       |   w       |

我需要按赛季和对手进行分组，计算获胜比例。

DF 2

|   season  | opponent  |  win %    |
-------------------------------------
|   2020    |   A       |   50      |
|   2020    |   B       |   100     |
|   2020    |   C       |   0       |
|   2021    |   A       |   100     |
|   2021    |   B       |   50      |
|   2021    |   C       |   100     |

接下来，我们需要按赛季计算评分。这是通过对同一赛季中的团队获胜％取平均值来完成的，但要注意的是对阵 A 队的获胜％值是其他团队的两倍。这只是一个任意公式，实际计算更复杂（不同的对手有不同的权重 - 我需要一种方法将其作为自定义 Lambda 函数的一部分传递），但我试图简化这个问题。

DF 3

|   season  |   rating  |
-------------------------
|   2020    |   50.0    |
|   2021    |   87.5    |

评分计算示例: 2020赛季的评分 = A队胜率% * 2 + B队胜率% + C队胜率% / (参赛队伍总数 + 1) = (50% * 2 + 100% + 0%) / (3 + 1) = 50.0

我们如何使用Pandas从第一个数据框转换到最后一个数据框？我可以通过以下方式得到DF 2的版本

df2 = df1.groupby(["season", "opponent"])["outcome"].value_counts(normalize = True).to_frame()

这个框架包括了不必要的损失百分比，但如果我能够在“转换”为DF 3的过程中过滤/删除它，那么这个问题就不重要了。

我一直在尝试像df2 = df2[df2["outcome"] != "w"]或者df2 = df2.query('outcome != "w"')这样的操作来删除条件为输的附加行，基于另一个问题的答案，但都没有成功。我怀疑这是因为outcome是一个嵌套列。我还注意到了这个问题，但我认为我需要的是一个通配符，以访问嵌套的outcome列，无论opponent如何。

注意：如果有更有效的方法可以直接从DF1到DF3（这似乎很接近，但不完全相同），我也很乐意探索这些方法。

- Alvin Teh

2个回答

1

import pandas as pd

df_test = pd.DataFrame(data={'season':[2020]*6 + [2021]*6, 'opponent': ['A', 'A', 'B', 'B', 'C', 'C']*2,
                        'outcome': ['w', 'l', 'w', 'w', 'l', 'l', 'w', 'w', 'w', 'l', 'w', 'w']})

df_weightage = pd.DataFrame(data={'season':[2020]*3 + [2021]*3, 'opponent': ['A', 'B', 'C']*2,
                        'weightage': [0.2, 0.3, 0.5, 0.1, 0.2, 0.7]})

print(df_test)
print('='*30)
print(df_weightage)
print('='*35)

def get_pct(data):
    return len(data[data == 'w'])/len(data)

def get_rating(data):
    return sum(data['win_percentage']*data['weightage'])/len(data)

df_test = df_test.groupby(["season", "opponent"])["outcome"].apply(get_pct).rename('win_percentage').reset_index()
print(df_test)
print('='*45)

df_test = df_test.merge(df_weightage, how= 'left', on=['season', 'opponent'])
print(df_test)
print('='*45)

df_ratings = df_test.groupby(['season'])[['win_percentage', 'weightage']].apply(get_rating).rename('ratings').reset_index()
print(df_ratings)

- Muhammad Hassan

谢谢你的回答！我已经验证它可以工作，但如果你能添加一些注释以便更容易理解就更好了。这也考虑到你是否使用了权重列。 - Alvin Teh

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- SeaBean · Accepted Answer

您可以按照以下方式获取df2:

df2 = (df1.groupby(["season", "opponent"])["outcome"]
          .value_counts(normalize=True)
          .unstack(fill_value=0).stack(dropna=False)
          .mul(100)
          .reset_index(name='win %')
          .query('outcome == "w"')
      ).reset_index(drop=True)

结果

print(df2)

   season opponent outcome  win %
0    2020        A       w   50.0
1    2020        B       w  100.0
2    2020        C       w    0.0
3    2021        A       w  100.0
4    2021        B       w   50.0
5    2021        C       w  100.0

接下来，要使用公式获取 df3，您可以使用以下代码：

df2a = df2.set_index('season')

# Get: (team A % * 2 + team B win % + team C win %)
df3_x = (df2a.loc[df2a['opponent'] =='A', 'win %'] * 2 
             + df2a.loc[df2a['opponent'] =='B', 'win %'] 
             + df2a.loc[df2a['opponent'] =='C', 'win %']
        )

# Get (total no of teams + 1) for a particular season
df3_y = df2.groupby('season')['opponent'].count() + 1

df3 = (df3_x / df3_y).reset_index(name='rating')

结果

print(df3)

   season  rating
0    2020    50.0
1    2021    87.5

供您参考，这里是在推导 df3 过程中的临时结果：

# team A % * 2 + team B win % + team C win % 
print(df3_x)

season
2020    200.0
2021    350.0
Name: win %, dtype: float64

# (total no of teams + 1) for a particular season
print(df3_y)

season
2020    4
2021    4
Name: opponent, dtype: int64