如何使用 Pandas 进行分组并计算其他列的值。

3

我已经总结了包含 col1、col2、col3 和 count 的数据框,需要在 count 上添加不同的权重。

数据集如下:


# Current result 
    col1 col2  col3   Count   Weightage_count
--------------------------------------------- 
 1:  A    S1   X110     2          2
 2:  A    S1   X150     2          0.5
 3:  A    S2   X212     2          1
 4:  A    S2   X200     1          0.5
 5:  A    S2   X211     1          0.25
 6:  B    S3   X311     4          4
 7:  C    S4   X222     3          1.5


data = {'Col1':['A','A','A','A','A','B','C'],
        'Col2':['S1','S1','S2','S2','S2','S3','S4'],
         'Col3':['X110','X150','X212','X200','X211','X311','X222'],
          'Count': [2,2,2,1,1,4,3],  
           'Weightage_count':[2, 0.5, 1, 0.5, 0.25, 4, 1.5]}

df = pd.DataFrame(data)

想要根据col1和col2计算结果。

  • 结果 = (通过Col1和Col2加权总数)/(通过Col1和Col2的总数)

期望的结果。

    Col1  Col2  Result
-------------------
1   A     S1     0.625
2   A     S2     0.5
3   B     S3     1 
4   C     S4     0.5
3个回答

3

首先在DataFrame.eval中聚合sum,然后再对多列进行乘法运算:

df = (df.groupby(['Col1','Col2'])
        .sum()
        .eval('Weightage_count / Count')
        .reset_index(name='Result'))
print (df)
  Col1 Col2  Result
0    A   S1  0.6250
1    A   S2  0.4375
2    B   S3  1.0000
3    C   S4  0.5000

或者使用Series.div除以DataFrame.pop来在处理后删除列:

df = df.groupby(['Col1','Col2'], as_index=False)[['Count','Weightage_count']].sum()
df['new'] = df.pop('Weightage_count').div(df.pop('Count'))
print (df)
  Col1 Col2     new
0    A   S1  0.6250
1    A   S2  0.4375
2    B   S3  1.0000
3    C   S4  0.5000

如果需要多列:

df = df.groupby(['Col1','Col2'])[['Count','Weightage_count']].sum()
df['new'] = df['Weightage_count'].div(df['Count'])
print (df)
           Count  Weightage_count     new
Col1 Col2                                
A    S1        4             2.50  0.6250
     S2        4             1.75  0.4375
B    S3        4             4.00  1.0000
C    S4        3             1.50  0.5000

2
我是eval的粉丝。此外,您可以在其中分配一个新列:df.groupby(['Col1', 'Col2']).sum().eval('Result = Weightage_count / Count').Result.reset_index() - piRSquared

1
使用`Groupby.agg`:
In [438]: x = df.groupby(['Col1', 'Col2']).agg({'Weightage_count': 'sum', 'Count': 'sum'})

In [439]: x['Result'] = x.Weightage_count/x.Count

In [440]: x
Out[440]: 
           Weightage_count  Count  Result
Col1 Col2                                
A    S1               2.50      4  0.6250
     S2               1.75      4  0.4375
B    S3               4.00      4  1.0000
C    S4               1.50      3  0.5000

1
您可以使用管道符

In [4]: group = df.groupby(['Col1', 'Col2'])

In [5]: group.pipe(lambda df: df.Weightage_count.sum()/df.Count.sum())
Out[5]: 
Col1  Col2
A     S1      0.6250
      S2      0.4375
B     S3      1.0000
C     S4      0.5000
dtype: float64

如果您想包含名称,可以使用rename方法:
In [13]: group.pipe(lambda df: df.Weightage_count.sum()/df.Count.sum()).rename('Result').reset_index()
Out[13]: 
  Col1 Col2  Result
0    A   S1  0.6250
1    A   S2  0.4375
2    B   S3  1.0000
3    C   S4  0.5000

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接