在pandas中应用返回多个值的函数，将其应用于pandas数据框中的行

Question

在pandas中应用返回多个值的函数，将其应用于pandas数据框中的行

pythonpandasdataframeapplyiterable-unpacking

97

我有一个带有时间索引的数据帧，包含三列，每列都包含三维向量的坐标：

                         x             y             z
ts
2014-05-15 10:38         0.120117      0.987305      0.116211
2014-05-15 10:39         0.117188      0.984375      0.122070
2014-05-15 10:40         0.119141      0.987305      0.119141
2014-05-15 10:41         0.116211      0.984375      0.120117
2014-05-15 10:42         0.119141      0.983398      0.118164

我希望对每一行应用一个转换，该转换还返回一个向量

def myfunc(a, b, c):
    do something
    return e, f, g

但是如果我这样做：

df.apply(myfunc, axis=1)

我最终得到了一个元素为元组的Pandas系列。这是因为apply会在不解包myfunc结果的情况下取出其结果。如何更改myfunc以获得具有三列的新df？

编辑：

以下所有解决方案都有效。 Series解决方案允许使用列名，而List解决方案似乎执行速度更快。

def myfunc1(args):
    e=args[0] + 2*args[1]
    f=args[1]*args[2] +1
    g=args[2] + args[0] * args[1]
    return pd.Series([e,f,g], index=['a', 'b', 'c'])

def myfunc2(args):
    e=args[0] + 2*args[1]
    f=args[1]*args[2] +1
    g=args[2] + args[0] * args[1]
    return [e,f,g]

%timeit df.apply(myfunc1 ,axis=1)

100 loops, best of 3: 4.51 ms per loop

%timeit df.apply(myfunc2 ,axis=1)

100 loops, best of 3: 2.75 ms per loop

- Fra

1

"将函数返回的元组(/列表)解包到多个列中是很有用的。" 而不是 "这是因为apply会采用未解包的myfunc结果。我该如何更改myfunc以获得一个带有3列的新df呢？" [标签：元组解包]/可迭代解包 - smci

7个回答

44

基于@U2EF1提供的答案，我创建了一个方便的函数，它将指定返回元组的函数应用于数据框字段，并将结果展开到数据框中。

def apply_and_concat(dataframe, field, func, column_names):
    return pd.concat((
        dataframe,
        dataframe[field].apply(
            lambda cell: pd.Series(func(cell), index=column_names))), axis=1)

使用方法:

df = pd.DataFrame([1, 2, 3], index=['a', 'b', 'c'], columns=['A'])
print df
   A
a  1
b  2
c  3

def func(x):
    return x*x, x*x*x

print apply_and_concat(df, 'A', func, ['x^2', 'x^3'])

   A  x^2  x^3
a  1    1    1
b  2    4    8
c  3    9   27

希望能对某人有所帮助。

- Dennis Golomazov

1

这太棒了。省了我很多时间。谢谢！ - stevehs17

26

有些其他人的答案有错误，因此我在下面总结了它们。完美的答案在下面。

准备数据集。使用的 pandas 版本为 1.1.5。

import numpy as np
import pandas as pd
import timeit

# check pandas version
print(pd.__version__)
# 1.1.5

# prepare DataFrame
df = pd.DataFrame({
    'x': [0.120117, 0.117188, 0.119141, 0.116211, 0.119141],
    'y': [0.987305, 0.984375, 0.987305, 0.984375, 0.983398],
    'z': [0.116211, 0.122070, 0.119141, 0.120117, 0.118164]},
    index=[
        '2014-05-15 10:38',
        '2014-05-15 10:39',
        '2014-05-15 10:40',
        '2014-05-15 10:41',
        '2014-05-15 10:42'],
    columns=['x', 'y', 'z'])
df.index.name = 'ts'
#                          x         y         z
# ts                                            
# 2014-05-15 10:38  0.120117  0.987305  0.116211
# 2014-05-15 10:39  0.117188  0.984375  0.122070
# 2014-05-15 10:40  0.119141  0.987305  0.119141
# 2014-05-15 10:41  0.116211  0.984375  0.120117
# 2014-05-15 10:42  0.119141  0.983398  0.118164

解决方案01.

在应用函数中返回pd.Series。

def myfunc1(args):
    e = args[0] + 2*args[1]
    f = args[1]*args[2] + 1
    g = args[2] + args[0] * args[1]
    return pd.Series([e, f, g])

df[['e', 'f', 'g']] = df.apply(myfunc1, axis=1)
#                          x         y         z         e         f         g
# ts
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327

t1 = timeit.timeit(
    'df.apply(myfunc1, axis=1)',
    globals=dict(df=df, myfunc1=myfunc1), number=10000)
print(round(t1, 3), 'seconds')
# 14.571 seconds

解决方案02。

在应用时使用result_type='expand'。

def myfunc2(args):
    e = args[0] + 2*args[1]
    f = args[1]*args[2] + 1
    g = args[2] + args[0] * args[1]
    return [e, f, g]

df[['e', 'f', 'g']] = df.apply(myfunc2, axis=1, result_type='expand')
#                          x         y         z         e         f         g
# ts                                                                          
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327

t2 = timeit.timeit(
    "df.apply(myfunc2, axis=1, result_type='expand')",
    globals=dict(df=df, myfunc2=myfunc2), number=10000)
print(round(t2, 3), 'seconds')
# 9.907 seconds

解决方案03。

如果您想加快速度，请使用np.vectorize。请注意，在使用np.vectorize时，args不能是单个参数。

def myfunc3(args0, args1, args2):
    e = args0 + 2*args1
    f = args1*args2 + 1
    g = args2 + args0 * args1
    return [e, f, g]

df[['e', 'f', 'g']] = pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)
#                          x         y         z         e         f         g
# ts                                                                          
# 2014-05-15 10:38  0.120117  0.987305  0.116211  2.094727  1.114736  0.234803
# 2014-05-15 10:39  0.117188  0.984375  0.122070  2.085938  1.120163  0.237427
# 2014-05-15 10:40  0.119141  0.987305  0.119141  2.093751  1.117629  0.236770
# 2014-05-15 10:41  0.116211  0.984375  0.120117  2.084961  1.118240  0.234512
# 2014-05-15 10:42  0.119141  0.983398  0.118164  2.085937  1.116202  0.235327

t3 = timeit.timeit(
    "pd.DataFrame(np.row_stack(np.vectorize(myfunc3, otypes=['O'])(df['x'], df['y'], df['z'])), index=df.index)",
    globals=dict(pd=pd, np=np, df=df, myfunc3=myfunc3), number=10000)
print(round(t3, 3), 'seconds')
# 1.598 seconds

- Keiku

20

我试过返回一个元组（我正在使用像scipy.stats.pearsonr这样返回那种结构的函数），但它返回了一个1D Series，而不是我预期的Dataframe。如果我手动创建一个Series，性能会更差，所以我按照官方API文档中的说明使用result_type进行修复：

在函数内部返回一个Series类似于传递 result_type='expand'。结果的列名将是Series索引。

所以你可以这样编辑你的代码:

def myfunc(a, b, c):
    # do something
    return (e, f, g)

df.apply(myfunc, axis=1, result_type='expand')

- Genarito

1

我喜欢这个，它似乎是最pandaic的，但只兼容pandas >= 0.0.23（根据Genarito提供的API文档链接）。 - spen.smith

9

如果你想在你的数据框中创建两列或三列（或n列），你可以使用以下代码： df['e'], df['f'], df['g'] = df.apply(myfunc, axis=1, result_type='expand').T.values。 - spen.smith

我们能否使用.apply方法来返回比df中现有行数更多的行，以创建一个稀疏副本？假设df有100行，函数为每一行返回100行，那么结果数据框应该有100*100行。这种情况可能吗？ - vevek seetharaman

真诚地说，我不知道。也许你能做的最好的事情就是在 Stack Overflow 上提出另一个问题，以获得一个定制的最佳答案。 - Genarito

我不得不使用df ['e' ]，d ['f' ]，d ['g'] = df.apply（myfunc，axis = 1，result_type ='expand'）.T.values建议@spen.smith。如果没有这个，直接分配列的值为0和1（例如，df [ "A" ]，df [ "B" ] = df.apply(foo，axis = 1，result_type ="expand")，其中foo返回["A"，"B"]或（"A"，"B"），分别为列A和B赋值0和1。 - chooks

谢谢！工作完美！ - zenalc

13

只需返回列表而不是元组。

In [81]: df
Out[81]: 
                            x         y         z
ts                                               
2014-05-15 10:38:00  0.120117  0.987305  0.116211
2014-05-15 10:39:00  0.117188  0.984375  0.122070
2014-05-15 10:40:00  0.119141  0.987305  0.119141
2014-05-15 10:41:00  0.116211  0.984375  0.120117
2014-05-15 10:42:00  0.119141  0.983398  0.118164

[5 rows x 3 columns]

In [82]: def myfunc(args):
   ....:        e=args[0] + 2*args[1]
   ....:        f=args[1]*args[2] +1
   ....:        g=args[2] + args[0] * args[1]
   ....:        return [e,f,g]
   ....: 

In [83]: df.apply(myfunc ,axis=1)
Out[83]: 
                            x         y         z
ts                                               
2014-05-15 10:38:00  2.094727  1.114736  0.234803
2014-05-15 10:39:00  2.085938  1.120163  0.237427
2014-05-15 10:40:00  2.093751  1.117629  0.236770
2014-05-15 10:41:00  2.084961  1.118240  0.234512
2014-05-15 10:42:00  2.085937  1.116202  0.235327

- Happy001

31

不起作用，返回一系列元素为列表的序列。我使用的是 pandas 0.18.1 版本。 - Kaushik Ghose

1

请参见下面的U2EF1响应 - 将结果列表包装到pd.Series()中。 - Bernard

11

通过将myfunc更改为返回np.array，找到了一个可能的解决方案，代码如下：

import numpy as np

def myfunc(a, b, c):
    do something
    return np.array((e, f, g))

有更好的解决方案吗？

- Fra

1

返回numpy数组在性能方面似乎是最好的选择。对于10万行数据，返回numpy数组以获取DataFrame列需要1.55秒；而使用return Series则需要39.7秒。这里的性能差异非常显著。 - Praveen

5

Pandas 1.0.5 版本中的 DataFrame.apply 方法有一个名为result_type 的参数，可以在这里提供帮助。来自文档的说明：

These only act when axis=1 (columns):

‘expand’ : list-like results will be turned into columns.

 ‘reduce’ : returns a Series if possible rather than expanding list-like results. This 
 is the opposite of ‘expand’.

‘broadcast’ : results will be broadcast to the original shape of the DataFrame, the 
original index and columns will be retained.

- Rachel Shalom

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- U2EF1 · Accepted Answer

93

返回 Series 并将它们放入一个 DataFrame 中。

def myfunc(a, b, c):
    do something
    return pd.Series([e, f, g])

这种方法的好处是您可以为每个生成的列分配标签。如果您返回一个DataFrame，则将为该组插入多行。

- U2EF1

请参阅更多示例：灵活应用。 - smile-on

8

系列解法似乎是标准答案。但在版本0.18.1上，使用系列解法比多次运行apply()函数要慢大约4倍。 - Kaushik Ghose

4

在每次迭代中创建一个完整的pd.Series不是非常低效吗？ - Marses

尝试这种方法时，我收到了“AttributeError: 'float' object has no attribute 'index'”的错误，但不确定为什么它正在尝试从其中一个值（float）中获取索引？（编辑）问题在于我有两个return语句，其中一个只是NaN，也需要包装在pd.Series()中。 - nicway

4

补充一下这个不错的回答，可以进一步执行 new_vars = ['e', 'f', 'g'] 和 df[new_vars] = df.apply(my_func, axis=1)。 - Quetzalcoatl