在处理一个组对象时，应该使用apply还是transform函数来减去两列并计算平均值。

Question

在处理一个组对象时，应该使用apply还是transform函数来减去两列并计算平均值。

250

考虑以下数据框：

columns = ['A', 'B', 'C', 'D']
records = [
    ['foo', 'one', 0.162003, 0.087469],
    ['bar', 'one', -1.156319, -1.5262719999999999],
    ['foo', 'two', 0.833892, -1.666304],     
    ['bar', 'three', -2.026673, -0.32205700000000004],
    ['foo', 'two', 0.41145200000000004, -0.9543709999999999],
    ['bar', 'two', 0.765878, -0.095968],
    ['foo', 'one', -0.65489, 0.678091],
    ['foo', 'three', -1.789842, -1.130922]
]
df = pd.DataFrame.from_records(records, columns=columns)

"""
     A      B         C         D
0  foo    one  0.162003  0.087469
1  bar    one -1.156319 -1.526272
2  foo    two  0.833892 -1.666304
3  bar  three -2.026673 -0.322057
4  foo    two  0.411452 -0.954371
5  bar    two  0.765878 -0.095968
6  foo    one -0.654890  0.678091
7  foo  three -1.789842 -1.130922
"""

以下命令可正常使用：

df.groupby('A').apply(lambda x: (x['C'] - x['D']))
df.groupby('A').apply(lambda x: (x['C'] - x['D']).mean())

但是以下的方法都不起作用：

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
# KeyError or ValueError: could not broadcast input array from shape (5) into shape (5,3)

df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())
# KeyError or TypeError: cannot concatenate a non-NDFrame object

为什么？文档中的示例似乎暗示着在组上调用transform可以进行逐行操作处理。

# Note that the following suggests row-wise operation (x.mean is the column mean)
zscore = lambda x: (x - x.mean()) / x.std()
transformed = ts.groupby(key).transform(zscore)

换句话说，我认为transform本质上是一种特定类型的apply（不进行聚合的类型）。我错在哪里？

供参考，以下是上述原始数据框的构造：

df = pd.DataFrame({'A' : ['foo', 'bar', 'foo', 'bar',
                          'foo', 'bar', 'foo', 'foo'],
                   'B' : ['one', 'one', 'two', 'three',
                         'two', 'two', 'one', 'three'],
                   'C' : randn(8), 'D' : randn(8)})

- Amelio Vazquez-Reina

2

传递给 transform 的函数必须返回一个数字、一行或与参数相同的形状。如果它是一个数字，那么这个数字将被设置为组中所有元素的值；如果它是一行，则会广播到组中的所有行。在您的代码中，lambda 函数返回一个无法广播到组的列。 - HYRY

1

谢谢@HYRY，但我有点困惑。如果您查看我上面复制的文档示例（即使用zscore），则transform接收一个lambda函数，该函数假定每个x都是group中的项，并且还返回组中每个项的值。我错过了什么吗？ - Amelio Vazquez-Reina

5个回答

202

由于我对.transform操作和 .apply操作感到困惑，因此我找到了几个答案来解决这个问题。例如，此答案非常有帮助。

我的理解是，.transform将与每个Series（列）独立地处理。这意味着在您的最后两个调用中：

df.groupby('A').transform(lambda x: (x['C'] - x['D']))
df.groupby('A').transform(lambda x: (x['C'] - x['D']).mean())

你请求.transform从两个列中获取值，但实际上它并不会同时“看到”这两个列（可以这么说）。transform将逐一查看数据框的列，并返回一个由标量 “构成”的系列（或一组系列），这些标量重复 len(input_column) 次。

因此，应该用一个输入 Series 上应用某个缩减函数的结果（仅在一个系列/列上）来生成应该由 .transform 使用的标量，以制作 Series。

请考虑以下示例（在您的数据框上）：

zscore = lambda x: (x - x.mean()) / x.std() # Note that it does not reference anything outside of 'x' and for transform 'x' is one column.
df.groupby('A').transform(zscore)

将产生：

       C      D
0  0.989  0.128
1 -0.478  0.489
2  0.889 -0.589
3 -0.671 -1.150
4  0.034 -0.285
5  1.149  0.662
6 -1.404 -0.907
7 -0.509  1.653

这与您只在一列上使用它完全相同：

df.groupby('A')['C'].transform(zscore)

产出：

请注意，在最后一个示例中，.apply（df.groupby('A')['C'].apply(zscore)）的工作方式完全相同，但如果您尝试在数据框上使用它，则会失败：

df.groupby('A').apply(zscore)

出现错误：

ValueError: operands could not be broadcast together with shapes (6,) (2,)

那么.transform还有哪些其他用途呢？最简单的情况是尝试将缩减函数的结果分配回原始数据框。

df['sum_C'] = df.groupby('A')['C'].transform(sum)
df.sort('A') # to clearly see the scalar ('sum') applies to the whole column of the group

生成：

     A      B      C      D  sum_C
1  bar    one  1.998  0.593  3.973
3  bar  three  1.287 -0.639  3.973
5  bar    two  0.687 -1.027  3.973
4  foo    two  0.205  1.274  4.373
2  foo    two  0.128  0.924  4.373
6  foo    one  2.113 -0.516  4.373
7  foo  three  0.657 -1.179  4.373
0  foo    one  1.270  0.201  4.373

使用 .apply 尝试相同操作会在 sum_C 中返回 NaNs。因为 .apply 会返回一个减少的 Series，它不知道如何广播回去:

df.groupby('A')['C'].apply(sum)

提供：

A
bar    3.973
foo    4.373

有时候也会使用.transform来过滤数据：

df[df.groupby(['B'])['D'].transform(sum) < -1]

     A      B      C      D
3  bar  three  1.287 -0.639
7  foo  three  0.657 -1.179

我希望这能更加清晰明了。

- Primer

20

我将使用一个非常简单的代码片段来说明差异：

test = pd.DataFrame({'id':[1,2,3,1,2,3,1,2,3], 'price':[1,2,3,2,3,1,3,1,2]})
grouping = test.groupby('id')['price']

数据框的外观如下所示：

这个表格中有3个客户ID，每个客户进行了三次交易，每次支付1、2、3美元。

现在，我想找到每个客户支付的最小金额。有两种方法：

使用apply：

grouping.min()

返回结果如下：

id
1    1
2    1
3    1
Name: price, dtype: int64

pandas.core.series.Series # return type
Int64Index([1, 2, 3], dtype='int64', name='id') #The returned Series' index
# lenght is 3

使用 transform:

grouping.transform(min)

返回结果如下：

0    1
1    1
2    1
3    1
4    1
5    1
6    1
7    1
8    1
Name: price, dtype: int64

pandas.core.series.Series # return type
RangeIndex(start=0, stop=9, step=1) # The returned Series' index
# length is 9

两种方法都返回一个Series对象，但第一种的length是3，第二种的length是9。

如果您想回答“每个客户支付的最低价格是多少”，那么选择apply方法更为合适。

如果您想回答“每个交易的支付金额与最低支付金额之间的差异是多少”，那么您需要使用transform，因为：

test['minimum'] = grouping.transform(min) # ceates an extra column filled with minimum payment
test.price - test.minimum # returns the difference for each row

Apply 在这里不起作用，因为它返回一个大小为 3 的 Series，但原始 df 的长度为 9。你无法轻松地将其集成回原始 df 中。

- Cheng

5

tmp = df.groupby(['A'])['c'].transform('mean')

就像

tmp1 = df.groupby(['A']).agg({'c':'mean'})
tmp = df['A'].map(tmp1['c'])

或者

tmp1 = df.groupby(['A'])['c'].mean()
tmp = df['A'].map(tmp1)

- shui

0

您可以使用zscore来分析C列和D列中的数据是否存在异常值，其中zscore是系列-系列.mean / series.std()。使用apply创建一个用户定义函数来计算C和D之间的差异，并创建一个新的结果数据框。Apply使用组结果集。

from scipy.stats import zscore

columns = ['A', 'B', 'C', 'D']
records = [
['foo', 'one', 0.162003, 0.087469],
['bar', 'one', -1.156319, -1.5262719999999999],
['foo', 'two', 0.833892, -1.666304],     
['bar', 'three', -2.026673, -0.32205700000000004],
['foo', 'two', 0.41145200000000004, -0.9543709999999999],
['bar', 'two', 0.765878, -0.095968],
['foo', 'one', -0.65489, 0.678091],
['foo', 'three', -1.789842, -1.130922]
]
df = pd.DataFrame.from_records(records, columns=columns)
print(df)

standardize=df.groupby('A')['C','D'].transform(zscore)
print(standardize)
outliersC= (standardize['C'] <-1.1) | (standardize['C']>1.1)
outliersD= (standardize['D'] <-1.1) | (standardize['D']>1.1)

results=df[outliersC | outliersD]
print(results)

   #Dataframe results
   A      B         C         D
   0  foo    one  0.162003  0.087469
   1  bar    one -1.156319 -1.526272
   2  foo    two  0.833892 -1.666304
   3  bar  three -2.026673 -0.322057
   4  foo    two  0.411452 -0.954371
   5  bar    two  0.765878 -0.095968
   6  foo    one -0.654890  0.678091
   7  foo  three -1.789842 -1.130922
 #C and D transformed Z score
           C         D
 0  0.398046  0.801292
 1 -0.300518 -1.398845
 2  1.121882 -1.251188
 3 -1.046514  0.519353
 4  0.666781 -0.417997
 5  1.347032  0.879491
 6 -0.482004  1.492511
 7 -1.704704 -0.624618

 #filtering using arbitrary ranges -1 and 1 for the z-score
      A      B         C         D
 1  bar    one -1.156319 -1.526272
 2  foo    two  0.833892 -1.666304
 5  bar    two  0.765878 -0.095968
 6  foo    one -0.654890  0.678091
 7  foo  three -1.789842 -1.130922


 >>>>>>>>>>>>> Part 2

 splitting = df.groupby('A')

 #look at how the data is grouped
 for group_name, group in splitting:
     print(group_name)

 def column_difference(gr):
      return gr['C']-gr['D']

 grouped=splitting.apply(column_difference)
 print(grouped)

 A     
 bar  1    0.369953
      3   -1.704616
      5    0.861846
 foo  0    0.074534
      2    2.500196
      4    1.365823
      6   -1.332981
      7   -0.658920

- Golden Lion

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ted Petrou · Accepted Answer

`apply`和`transform`之间的两个主要区别

transform和apply groupby方法之间有两个主要区别：

输入：
- apply会将每个组的所有列隐式地作为一个DataFrame传递给自定义函数。
- 而transform会将每个组的每个列单独作为一个Series传递给自定义函数。
输出：
- apply传递给自定义函数的函数可以返回标量、Series或DataFrame（甚至是numpy数组或列表）。
- transform传递给自定义函数的函数必须返回一个序列（一维Series、数组或列表），且长度与组相同。

因此，transform仅逐个处理一个Series，而apply一次性处理整个DataFrame。

检查自定义函数

检查传递给apply或transform的自定义函数的输入可以帮助您更好地理解它们的功能。

示例

让我们创建一些样本数据并检查组，以便您可以看到我在说什么：

import pandas as pd
import numpy as np
df = pd.DataFrame({'State':['Texas', 'Texas', 'Florida', 'Florida'], 
                   'a':[4,5,1,3], 'b':[6,10,3,11]})

     State  a   b
0    Texas  4   6
1    Texas  5  10
2  Florida  1   3
3  Florida  3  11

让我们创建一个简单的自定义函数，它会打印出隐式传递对象的类型，然后引发异常以停止执行。

def inspect(x):
    print(type(x))
    raise

现在让我们将这个函数传递给groupby的apply和transform方法，以查看传递给它的对象：

现在让我们将这个函数传递给groupby的apply和transform方法，以查看传递给它的对象：

df.groupby('State').apply(inspect)

<class 'pandas.core.frame.DataFrame'>
<class 'pandas.core.frame.DataFrame'>
RuntimeError

如您所见，一个 DataFrame 被传递给了 inspect 函数。你可能会想知道为什么 DataFrame 这个类型会被打印两次。Pandas 会运行第一组两次，以确定是否有快速完成计算的方法。这是一个小细节，您不必担心。

现在，让我们用 transform 做同样的事情。

df.groupby('State').transform(inspect)
<class 'pandas.core.series.Series'>
<class 'pandas.core.series.Series'>
RuntimeError

传递给它的是一个序列 - 这是完全不同的Pandas对象。

因此，transform只允许一次处理单个序列。它无法同时对两个列进行操作。那么，如果我们尝试在自定义函数中从列a中减去列b，则会在transform中出现错误。请参见下面:

def subtract_two(x):
    return x['a'] - x['b']

df.groupby('State').transform(subtract_two)
KeyError: ('a', 'occurred at index a')

我们在尝试查找不存在的Series索引 a时出现了KeyError错误。由于apply拥有整个DataFrame，因此您可以使用它来完成此操作：

df.groupby('State').apply(subtract_two)

State     
Florida  2   -2
         3   -8
Texas    0   -2
         1   -5
dtype: int64

输出结果是一个Series，由于原始索引被保留，可能有点令人困惑，但我们可以访问所有列。

显示传递的pandas对象

在自定义函数中显示整个pandas对象可能会更有帮助，这样您就可以准确地看到正在操作的内容。您可以使用print语句，但我喜欢使用IPython.display模块中的display函数，以便在jupyter notebook中将DataFrames漂亮地输出为HTML：

from IPython.display import display
def subtract_two(x):
    display(x)
    return x['a'] - x['b']

截图：

Transform必须返回与组大小相同的单维序列

另一个区别是transform必须返回与组大小相同的单维序列。在这个特定的实例中，每个组有两行，因此transform必须返回一个包含两行的序列。如果没有，则会引发错误：

def return_three(x):
    return np.array([1, 2, 3])

df.groupby('State').transform(return_three)
ValueError: transform must return a scalar value for each group

错误信息并未很好地描述问题。您必须返回与组长度相同的序列。因此，像这样的函数将起作用：

def rand_group_len(x):
    return np.random.rand(len(x))

df.groupby('State').transform(rand_group_len)

          a         b
0  0.962070  0.151440
1  0.440956  0.782176
2  0.642218  0.483257
3  0.056047  0.238208

对于`transform`，返回单个标量对象也是有效的

如果您从自定义函数中仅返回一个标量，则transform将在组中的每一行使用它：

def group_sum(x):
    return x.sum()

df.groupby('State').transform(group_sum)

   a   b
0  9  16
1  9  16
2  4  14
3  4  14

在处理一个组对象时，应该使用apply还是transform函数来减去两列并计算平均值。

apply和transform之间的两个主要区别

检查自定义函数

示例

显示传递的pandas对象

Transform必须返回与组大小相同的单维序列

对于transform，返回单个标量对象也是有效的

`apply`和`transform`之间的两个主要区别

对于`transform`，返回单个标量对象也是有效的