Pandas按复杂条件进行分组聚合以及对多列进行聚合处理。

Question

Pandas按复杂条件进行分组聚合以及对多列进行聚合处理。

3

我有以下数据集：

import pandas as pd
from itertools import combinations

d = {'Order_ID': ['001', '001', '002', '003', '003', '003', '004', '004'], 
 'Products': ['Apple', 'Pear', 'Banana', 'Apple', 'Pear', 'Banana', 'Apple', 'Pear'],
 'Revenue': [15, 10, 5, 25, 15, 10, 5, 30]}
df = pd.DataFrame(data=d)
df

产出：

    Order_ID    Products    Revenue
  0   001        Apple        15
  1   001        Pear         10
  2   002        Banana       5
  3   003        Apple        25
  4   003        Pear         15
  5   003        Banana       10
  6   004        Apple        5
  7   004        Pear         30

我希望实现的是一个数据集，其中包含了所有交易中可能出现的交易对的组合，获取它们的频率和总收入的累积总和。应该看起来像这样：

d = {'Groups': ['(Apple, Pear)', '(Banana, Apple)', '(Banana, Pear)'], 
 'Frequency': [3, 1, 1],
 'Revenue': [100, 35, 40]}
df2 = pd.DataFrame(data=d)
df2

这将返回：

   Groups         Frequency    Revenue
0  (Apple, Pear)      3          100
1  (Banana, Apple)    1           35
2  (Banana, Pear)     1           40

我能获得成对项及其频率，但是我无法想出如何在我使用的groupby语句中获取收入部分。

def find_pairs(x):
  return pd.Series(list(combinations(set(x), 2)))

df_group = df.groupby('Order_ID')['Products'].apply(find_pairs).value_counts()
df_group

我需要在将“Products”应用于函数后添加另一个条件，即通过find_pairs函数创建的“新”组对“Revenue”进行总和。收入必须是每个成对组的总和，也就是说，每次重复出现该组时，都要将产品收入加到该组的累计总和中。

- Iñaki Baglivo

3个回答

0

一个可能的解决方案：

import pandas as pd
import numpy as np
from itertools import combinations

# create pairs per order id
def pairs_per_id(df):
     pairs = (pd.concat(
     list(map(
          lambda x: pd.DataFrame({'Groups': [x], 
          'Frequency': sum(df['Products'].isin(x)) // 2,
          'Revenue': df.loc[df['Products'].isin(x), 'Revenue'].sum()}),
          combinations(np.unique(df['Products']), 2)))).reset_index(drop=True))
     return pairs

(df[df.groupby('Order_ID')['Order_ID']  
    .transform('count') != 1] # remove singleton groups
 .groupby('Order_ID', group_keys=False)
 .apply(pairs_per_id).groupby('Groups').sum().reset_index())

输出：

            Groups  Frequency  Revenue
0  (Apple, Banana)          1       35
1    (Apple, Pear)          3      100
2   (Banana, Pear)          1       25

- PaulS

0

这在一行代码中有点棘手，但如果你愿意使用一个中间的DataFrame，其中收入按对索引，那就很简单了。

def pair_revenues_by_order(x):
    return {
        tuple(sorted([p0.Products, p1.Products])): p0.Revenue + p1.Revenue
        for [_, p0], [_, p1] in combinations(x.iterrows(), 2)
    }

pair_indexed_revenue = df.groupby("Order_ID").apply(pair_revenues_by_order).apply(pd.Series)

#            Apple         Pear
#          Banana  Pear Banana
# Order_ID                    
# 001         NaN  25.0    NaN
# 002         NaN   NaN    NaN
# 003        35.0  40.0   25.0
# 004         NaN  35.0    NaN

pair_totals = pd.DataFrame(
    {"total_revenue": pair_indexed_revenue.sum(axis=0), "frequency": pair_indexed_revenue.count(axis=0)}
)

# to get simple tuple indices instead of MultiIndex
pair_totals.set_index(pair_totals.index.to_flat_index())

#                  total_revenue  frequency
# (Apple, Banana)           35.0          1
# (Apple, Pear)            100.0          3
# (Banana, Pear)            25.0          1

编辑：添加 tuple(sorted(...))，这些对需要是可哈希且唯一的，否则如果你有一个顺序，在这个顺序中香蕉出现在苹果之前，你将会得到一个 (Banana, Apple) 和一个 (Apple, Banana)

- mmdanziger

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- SomeDude · Accepted Answer

您可以做：

f = lambda x: list(itertools.combinations(x,2))
t = df.groupby('Order_ID').agg(f).explode(['Products', 'Revenue']).dropna()
out = t.groupby('Products').agg(
    Frequency=('Products','count'),
    Revenue=('Revenue', lambda x : sum([sum(y) for y in x]))
)

打印输出：

                 Frequency  Revenue
Products                           
(Apple, Banana)          1       35
(Apple, Pear)            3      100
(Pear, Banana)           1       25

请注意，来自'Order_ID'='003'组的（梨子，香蕉）的收入将为15+10=25，而不是40。