pandas，将多列的多个函数应用于groupby对象

Question

pandas，将多列的多个函数应用于groupby对象

14

我希望将多个列的多个函数应用于一个groupby对象，从而得到一个新的pandas.DataFrame。

我知道如何分步骤完成：

by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))

这导致user_df的结果为：

然而，我怀疑还有更好的方法，类似于：

by_user.agg({'elapsed_days': lambda x: (x.elapsed_time * x.num_cores).sum() / 86400, 
             'running_days': lambda x: (x.running_time * x.num_cores).sum() / 86400})

然而，这不起作用，因为据我所知 agg() 是针对 pandas.Series 进行操作的。

我找到了这个问题和答案，但解决方案对我来说看起来相当丑陋，考虑到答案已经将近四年了，现在可能有更好的方法。

- johnbaltis

6个回答

5

我认为你可以避免使用agg或apply，而是首先通过mul进行乘法运算，然后再使用div除法运算，最后使用按index分组并使用sum进行汇总聚合：

lasts = pd.DataFrame({'user':['a','s','d','d'],
                   'elapsed_time':[40000,50000,60000,90000],
                   'running_time':[30000,20000,30000,15000],
                   'num_cores':[7,8,9,4]})

print (lasts)
   elapsed_time  num_cores  running_time user
0         40000          7         30000    a
1         50000          8         20000    s
2         60000          9         30000    d
3         90000          4         15000    d

by_user = lasts.groupby('user')
elapsed_days = by_user.apply(lambda x: (x.elapsed_time * x.num_cores).sum() / 86400)
print (elapsed_days)
running_days = by_user.apply(lambda x: (x.running_time * x.num_cores).sum() / 86400)
user_df = elapsed_days.to_frame('elapsed_days').join(running_days.to_frame('running_days'))
print (user_df)
      elapsed_days  running_days
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852

lasts = lasts.set_index('user')
print (lasts[['elapsed_time','running_time']].mul(lasts['num_cores'], axis=0)
                                             .div(86400)
                                             .groupby(level=0)
                                             .sum())
      elapsed_time  running_time
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852

- jezrael

2

如果你想在使用同一个数据框中的其他列的数据时，在groupby对象上使用agg方法，可以按照以下步骤进行：

Define your functions (lambda functions or not) that take as an input a Series, and get the data from other column(s) using the df.loc[series.index, col] syntax. With this example:
```
ed = lambda x: (x * lasts.loc[x.index, "num_cores"]).sum() / 86400. 
rd = lambda x: (x * lasts.loc[x.index, "num_cores"]).sum() / 86400.
```
where lasts is the main DataFrame, and we access the data in the column num_cores thanks to the .loc method.
Create a dictionary with these functions and the name for the newly created columns. The keys are the name of the columns on which to apply each function, and the value is another dictionary where the key is the name of the function and the value is the function.
```
my_func = {"elapsed_time" : {"elapsed_day" : ed},
           "running_time" : {"running_days" : rd}}
```

Groupby and aggregate:

user_df = lasts.groupby("user").agg(my_func)
user_df
     elapsed_time running_time
      elapsed_day running_days
user                          
a        3.240741     2.430556
d       10.416667     3.819444
s        4.629630     1.851852

If you want to remove the old column names:

 user_df.columns = user_df.columns.droplevel(0)
 user_df
      elapsed_day  running_days
user                           
a        3.240741      2.430556
d       10.416667      3.819444
s        4.629630      1.851852

HTH

- jrjc

1

作为对悬赏的回应，我们可以使用标准库中的functools.partial函数进行部分应用，使其更加通用。

import functools
import pandas as pd

#same data as other answer:
lasts = pd.DataFrame({'user':['a','s','d','d'],
                   'elapsed_time':[40000,50000,60000,90000],
                   'running_time':[30000,20000,30000,15000],
                   'num_cores':[7,8,9,4]})

#define the desired lambda as a function:
def myfunc(column, df, cores):
    return (column * df.ix[column.index][cores]).sum()/86400

#use the partial to define the function with a given column and df:
mynewfunc = functools.partial(myfunc, df = lasts, cores = 'num_cores')

#agg by the partial function
lasts.groupby('user').agg({'elapsed_time':mynewfunc, 'running_time':mynewfunc})

"这给了我们："

    running_time    elapsed_time
user        
a   2.430556    3.240741
d   3.819444    10.416667
s   1.851852    4.629630

这个例子并不是特别有用，但作为一个通用的例子可能更有用。

- jeremycg

0

这里有一个解决方案，它与“我怀疑有更好的方法”下表达的原始想法非常相似。

我将使用与其他答案相同的测试数据：

lasts = pd.DataFrame({'user':['a','s','d','d'],
                      'elapsed_time':[40000,50000,60000,90000],
                      'running_time':[30000,20000,30000,15000],
                      'num_cores':[7,8,9,4]})

groupby.apply可以接受返回DataFrame的函数，并自动将返回的数据帧拼接在一起。以下措辞中有两个小陷阱。第一个是注意到传递给DataFrame的值实际上是单元素列表，而不仅仅是数字。

def aggfunc(group):
    """ This function mirrors the OP's idea. Note the values below are lists """
    return pd.DataFrame({'elapsed_days': [(group.elapsed_time * group.num_cores).sum() / 86400], 
                         'running_days': [(group.running_time * group.num_cores).sum() / 86400]})

user_df = lasts.groupby('user').apply(aggfunc)

结果：

        elapsed_days  running_days
user                              
a    0      3.240741      2.430556
d    0     10.416667      3.819444
s    0      4.629630      1.851852

第二个问题是返回的数据框具有分层索引（即零列），可以像以下示例一样展开：

user_df.index = user_df.index.levels[0]

结果：

      elapsed_days  running_days
user                            
a         3.240741      2.430556
d        10.416667      3.819444
s         4.629630      1.851852

- chthonicdaemon

0

这个聚合函数可能是你正在寻找的。

我添加了一个示例数据集，并将操作应用于名为lasts_的lasts的副本。

import pandas as pd

lasts = pd.DataFrame({'user'        :['james','james','james','john','john'],
                      'elapsed_time':[ 200000, 400000, 300000,800000,900000],
                      'running_time':[ 100000, 100000, 200000,600000,700000],
                      'num_cores'   :[      4,      4,      4,     8,     8] })

# create temporary df to add columns to, without modifying original dataframe
lasts_ = pd.Series.to_frame(lasts.loc[:,'user'])  # using 'user' column to initialize copy of new dataframe.  to_frame gives dataframe instead of series so more columns can be added below
lasts_['elapsed_days'] = lasts.loc[:,'elapsed_time'] * lasts.loc[:,'num_cores'] / 86400
lasts_['running_days'] = lasts.loc[:,'running_time'] * lasts.loc[:,'num_cores'] / 86400

# aggregate
by_user = lasts_.groupby('user').agg({'elapsed_days': 'sum', 
                                      'running_days': 'sum' })

# by_user:
# user  elapsed_days        running_days
# james 41.66666666666667   18.51851851851852
# john  157.4074074074074   120.37037037037037

如果您想将“user”保留为普通列而不是索引列，请使用以下代码：

by_user = lasts_.groupby('user', as_index=False).agg({'elapsed_days': 'sum', 
                                                      'running_days': 'sum'})

- jberrio

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- zthomas.nc · Accepted Answer

另一个可靠的解决方案变体是采用@MaxU在this solution中所做的方法，对a similar question进行处理，并将单个函数包装在Pandas系列中，因此只需要reset_index()即可返回数据框。

首先，定义转换函数：

def ed(group):
    return group.elapsed_time * group.num_cores).sum() / 86400

def rd(group):
    return group.running_time * group.num_cores).sum() / 86400

使用 get_stats 将它们包装成一个系列：

def get_stats(group):
    return pd.Series({'elapsed_days': ed(group),
                      'running_days':rd(group)})

最后：

lasts.groupby('user').apply(get_stats).reset_index()