Pandas基于当前数据框中的某一列创建一个新的数据框。

3
需要基于几个公式创建一个新的数据框,我可以使用分组和合并来创建多个数据帧。但是有没有更有效的方法来实现?
df_1 如下:
  df_1 = pd.DataFrame([['A', '1/1/2021','SKU_1','Customer Backhaul','34,848','$-51,100'],
              ['A', '1/1/2021','SKU_1','FOB','75,357','$12,407,112'],
              ['A', '1/1/2021','SKU_1','Price','75,357','$12,407,112'],
              ['A', '1/1/2021','SKU_1','Vendor Freight - Delivered','40,511','$65,470'],
              ['B', '1/1/2021','SKU_1','Customer Backhaul','197,904','$-157,487'],
              ['B', '1/1/2021','SKU_1','FOB','931,866','$50,059,515'],
              ['B', '1/1/2021','SKU_1','Price','931,866','$62,333,500'],
              ['B', '1/1/2021','SKU_1','Vendor Freight - Delivered','740,355','$1,220,927']], 
              columns=['Group', 'Month','ID','Cost Type','Volume','Order Cost'])

'Value'栏的公式:

  • 货运费 = 客户回程绝对值 + 供应商送货
  • 离岸价 = 离岸价
  • 价格 = 价格 - 回程
  • 体积 = 离岸价的体积

Out[df_2]

   Group   Month     ID      Cost Type   Volume    Cost
0   A   1/1/2021    SKU_1   Freight     75,357  $116,570
1   A   1/1/2021    SKU_1   FOB         75,357  $12,407,112
2   A   1/1/2021    SKU_1   Price       75,357  $12,434,063
3   B   1/1/2021    SKU_1   Freight     931,866 $1,378,414
4   B   1/1/2021    SKU_1   FOB         931,866 $50,059,515
5   B   1/1/2021    SKU_1   Price       931,866 $62,490,988

1
为了完全确定 - 您不需要数据框中的“Formulae”列,对吗?这只是为了让我们理解计算的方法? - CutePoison
1
仅对“value”列的计算进行解释。抱歉造成困惑。 - spartanboy
2个回答

4
我们可以使用groupby.apply来应用这些公式:
  • 要么返回每个组的Volume/Cost DataFrame
  • 要么返回Cost元组的Series并explode它们

DataFrame选项

  1. First convert the number strings into actual numbers (or if you're loading data with read_csv, use the thousands param):

    df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int)
    df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int)
    
  2. Given a Group/Month/ID group, return its Volume and Cost as a dataframe:

    def formulae_df(g):
        # set index to Cost Type for simpler indexing
        g = g.set_index('Cost Type')
    
        # initialize empty result df
        df = pd.DataFrame(columns=['Volume', 'Cost'], index=['Freight', 'FOB', 'Price']).rename_axis('Cost Type')
    
        # fill result df with forumlae
        df['Volume'] = g.loc['FOB', 'Volume']
        df.loc['Freight', 'Cost'] = abs(g.loc['Customer Backhaul', 'Order Cost']) + g.loc['Vendor Freight - Delivered', 'Order Cost']
        df.loc['FOB', 'Cost'] = g.loc['FOB', 'Order Cost']
        df.loc['Price', 'Cost'] = g.loc['Price', 'Order Cost'] - g.loc['Customer Backhaul', 'Order Cost']
        return df
    
  3. Then apply formulae_df with groupby.apply:

    df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_df).reset_index()
    
    #   Group     Month     ID  Cost Type  Volume      Cost
    # 0     A  1/1/2021  SKU_1    Freight   75357    116570
    # 1     A  1/1/2021  SKU_1        FOB   75357  12407112
    # 2     A  1/1/2021  SKU_1      Price   75357  12458212
    # 3     B  1/1/2021  SKU_1    Freight  931866   1378414
    # 4     B  1/1/2021  SKU_1        FOB  931866  50059515
    # 5     B  1/1/2021  SKU_1      Price  931866  62490987
    

使用explode函数处理数据系列选项

由于每个组中都有一个Volume与多个Cost,我们可以将Cost生成为列表/元组,然后使用explode函数来展开它们:

  1. The first step is still to convert the number strings into actual numbers:

    df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int)
    df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int)
    
  2. Given a Group/Month/ID group, compute its Volume (value) and Cost (tuple):

    def formulae_series(g):
        # set index for easy loc access
        g = g.set_index('Cost Type')
    
        # compute formulae
        volume = g.loc['FOB', 'Volume']
        costs = {
            'Freight': abs(g.loc['Customer Backhaul', 'Order Cost']) + g.loc['Vendor Freight - Delivered', 'Order Cost'],
            'FOB': g.loc['FOB', 'Order Cost'],
            'Price': g.loc['Price', 'Order Cost'] - g.loc['Customer Backhaul', 'Order Cost'],
        }
    
        # return volume as a value and costs as tuples
        return pd.Series({'Cost Type': costs.keys(), 'Volume': volume, 'Cost': costs.values()})
    
  3. When we apply formulae_series with groupby.apply, note how the Cost* columns contain tuples:

    df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_series)
    
    #                                   Cost Type  Volume                           Cost
    # Group Month    ID
    # A     1/1/2021 SKU_1  (Freight, FOB, Price)   75357   (116570, 12407112, 12458212)
    # B     1/1/2021 SKU_1  (Freight, FOB, Price)  931866  (1378414, 50059515, 62490987)
    
  4. So now explode those tuples into rows:

    df_2 = df_2.explode(['Cost Type', 'Cost']).reset_index()
    
    #   Group     Month     ID  Cost Type  Volume      Cost
    # 0     A  1/1/2021  SKU_1    Freight   75357    116570
    # 1     A  1/1/2021  SKU_1        FOB   75357  12407112
    # 2     A  1/1/2021  SKU_1      Price   75357  12458212
    # 3     B  1/1/2021  SKU_1    Freight  931866   1378414
    # 4     B  1/1/2021  SKU_1        FOB  931866  50059515
    # 5     B  1/1/2021  SKU_1      Price  931866  62490987
    

完整代码

以下是重组步骤(包括可选的转换回逗号/美元符号):

## load df_1
df_1 = pd.DataFrame([['A','1/1/2021','SKU_1','Customer Backhaul','34,848','$-51,100'],['A','1/1/2021','SKU_1','FOB','75,357','$12,407,112'],['A','1/1/2021','SKU_1','Price','75,357','$12,407,112'],['A','1/1/2021','SKU_1','Vendor Freight - Delivered','40,511','$65,470'],['B','1/1/2021','SKU_1','Customer Backhaul','197,904','$-157,487'],['B','1/1/2021','SKU_1','FOB','931,866','$50,059,515'],['B','1/1/2021','SKU_1','Price','931,866','$62,333,500'],['B','1/1/2021','SKU_1','Vendor Freight - Delivered','740,355','$1,220,927']],columns=['Group','Month','ID','Cost Type','Volume','Order Cost'])

## convert to numerics
df_1['Volume'] = df_1['Volume'].str.replace(',', '').astype(int)
df_1['Order Cost'] = df_1['Order Cost'].str.replace(r'[$,]', '', regex=True).astype(int)

## dataframe option
df_2 = df_1.groupby(['Group', 'Month', 'ID']).apply(formulae_df).reset_index()

## or apply formulae and explode costs
# df_2 = (df_1.groupby(['Group', 'Month', 'ID'])
#             .apply(formulae_series)
#             .explode(['Cost Type', 'Cost'])
#             .reset_index())

## optional: revert to comma/dollar strings
df_2['Volume'] = df_2['Volume'].map('{:,}'.format)
df_2['Cost'] = df_2['Cost'].map('${:,}'.format)

输出:

  Group     Month     ID  Cost Type   Volume         Cost
0     A  1/1/2021  SKU_1    Freight   75,357     $116,570
1     A  1/1/2021  SKU_1        FOB   75,357  $12,407,112
2     A  1/1/2021  SKU_1      Price   75,357  $12,458,212
3     B  1/1/2021  SKU_1    Freight  931,866   $1,378,414
4     B  1/1/2021  SKU_1        FOB  931,866  $50,059,515
5     B  1/1/2021  SKU_1      Price  931,866  $62,490,987

2
我发现有时候很难区分是要进行计算,还是只是想在文本中加入计算结果以及数字来源等信息。
以下是我对该问题的理解,并提供了一个示例(如果我理解有误,请随意评论):
您可以循环遍历每一行,然后计算值(并使用“replace”函数替换公式)。
replace_dict = {"Freight":"ABS(Backhaul)+Vendor Delivered", "Price":"Price - Backhaul",...,}
df["Definition or Formulae"] = df["Cost Type"].replace(replace_dict) #Replace "Freight with "ABS(Backhaul)+Vendor" etc in a new column called "Definition and Formulae


#Do some calculations with the values in each row
def get_value(row):
   cost_type = row["Cost Type"]
   if cost_type == "Freight":
        return row["Backhaul"].abs()+ row["Vendor Delivered"]
   if cost_type== "Price":
        return row["Price"] - row["Backhaul"]
   .
   .
   return row["Value"]  # If nothing special to do, just return the value
df["Value"] = df.apply(get_value, axis=1)

1
我不需要打印注释。但是,根据提供的公式,我需要在“值”列中执行一些计算。 - spartanboy
1
已更新问题并明确期望。希望这样可以澄清。 - spartanboy

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接