在pandas数据框中对两列求和

Question

在pandas数据框中对两列求和

66

当我使用这种语法时，它创建了一个序列而不是向我的新数据帧 sum 添加一列。

我的代码：

sum = data['variance'] = data.budget + data.actual

我的数据框 data 目前包含了除了 budget - actual 列之外的所有内容。我该如何创建一个 variance 列？

    cluster  date                  budget  actual budget - actual
0   a        2014-01-01  00:00:00  11000   10000       1000
1   a        2014-02-01  00:00:00  1200    1000
2   a        2014-03-01  00:00:00  200     100
3   b        2014-04-01  00:00:00  200     300
4   b        2014-05-01  00:00:00  400     450
5   c        2014-06-01  00:00:00  700     1000
6   c        2014-07-01  00:00:00  1200    1000
7   c        2014-08-01  00:00:00  200     100
8   c        2014-09-01  00:00:00  200     300

- yoshiserry

要么我无法重现你的结果，要么我没有理解你的意思。你是说在执行这个操作后，data中不包含“方差”列吗？为什么要同时赋值给两个变量？你想得到什么结果？ - BrenBarn

我想说的是，如果我打印(sum)变量的结果，我会得到一个系列，而不是一个新的数据框，它具有我的原始数据框(df)所具有的内容，但也有一个方差(budget - actual)列？ - yoshiserry

1

data.budget + data.actual 会返回一个Series。你可以将其赋值给 sum，这样 sum 就是一个Series。如果你想要一个DataFrame，你需要创建一个DataFrame，然后将 data.budget + data.actual 赋值给该DataFrame的一列。 - BrenBarn

7个回答

30

df['variance'] = df.loc[:,['budget','actual']].sum(axis=1)

- pylist

如果您需要对多列进行求和： df['variance'] = df.iloc[:, 1: ].sum(axis=1) - Deepak

2

df['方差'] = df[['预算','实际']].sum(axis=1) # 看起来更漂亮 - Echo9k

3

这是最优雅的解决方案，遵循DRY原则并且运作非常出色。

dataframe_name['col1', 'col2', 'col3'].sum(axis = 1, skipna = True)

谢谢。

- Sahaj Raj Malla

2

您也可以使用 .add() 函数：

 df.loc[:,'variance'] = df.loc[:,'budget'].add(df.loc[:,'actual'])

- Archie

2

可以使用lambda函数来完成相同的操作。在这里，我正在从xlsx文件中读取数据。

import pandas as pd
df = pd.read_excel("data.xlsx", sheet_name = 4)
print df

输出：

  cluster Unnamed: 1      date  budget  actual
0       a 2014-01-01  00:00:00   11000   10000
1       a 2014-02-01  00:00:00    1200    1000
2       a 2014-03-01  00:00:00     200     100
3       b 2014-04-01  00:00:00     200     300
4       b 2014-05-01  00:00:00     400     450
5       c 2014-06-01  00:00:00     700    1000
6       c 2014-07-01  00:00:00    1200    1000
7       c 2014-08-01  00:00:00     200     100
8       c 2014-09-01  00:00:00     200     300

将两列相加并生成第三列。

df['variance'] = df.apply(lambda x: x['budget'] + x['actual'], axis=1)
print df

输出：

  cluster Unnamed: 1      date  budget  actual  variance
0       a 2014-01-01  00:00:00   11000   10000     21000
1       a 2014-02-01  00:00:00    1200    1000      2200
2       a 2014-03-01  00:00:00     200     100       300
3       b 2014-04-01  00:00:00     200     300       500
4       b 2014-05-01  00:00:00     400     450       850
5       c 2014-06-01  00:00:00     700    1000      1700
6       c 2014-07-01  00:00:00    1200    1000      2200
7       c 2014-08-01  00:00:00     200     100       300
8       c 2014-09-01  00:00:00     200     300       500

- LOrD_ARaGOrN

当有许多列并且不想编写 x['col1']+...+x['coln'] 时，是否有其他替代方法？ - alancalvitti

我认为这是最好的解决方案 :) - ambigus9

1

如果“budget”有任何NaN值，但您不希望它求和为NaN，则可以尝试：

def fun (b, a):
    if math.isnan(b):
        return a
    else:
        return b + a

f = np.vectorize(fun, otypes=[float])

df['variance'] = f(df['budget'], df_Lp['actual'])

- R. Cox

-1

eval让您立即对列进行求和和创建：

In [12]: data.eval('variance = budget + actual', inplace=True)

In [13]: data
Out[13]: 
        cluster      date  budget  actual  variance
0 a  2014-01-01  00:00:00   11000   10000     21000
1 a  2014-02-01  00:00:00    1200    1000      2200
2 a  2014-03-01  00:00:00     200     100       300
3 b  2014-04-01  00:00:00     200     300       500
4 b  2014-05-01  00:00:00     400     450       850
5 c  2014-06-01  00:00:00     700    1000      1700
6 c  2014-07-01  00:00:00    1200    1000      2200
7 c  2014-08-01  00:00:00     200     100       300
8 c  2014-09-01  00:00:00     200     300       500

由于 inplace=True，您不需要将其重新分配给 data。

- rachwa

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andy Hayden · Accepted Answer

我认为您对 Python 语法有所误解，以下代码实现了两个赋值操作：

In [11]: a = b = 1

In [12]: a
Out[12]: 1

In [13]: b
Out[13]: 1

在你的代码中，就好像你在做这样的事情：

sum = df['budget'] + df['actual']  # a Series
# and
df['variance'] = df['budget'] + df['actual']  # assigned to a column

后者为df创建一个新列：

In [21]: df
Out[21]:
  cluster                 date  budget  actual
0       a  2014-01-01 00:00:00   11000   10000
1       a  2014-02-01 00:00:00    1200    1000
2       a  2014-03-01 00:00:00     200     100
3       b  2014-04-01 00:00:00     200     300
4       b  2014-05-01 00:00:00     400     450
5       c  2014-06-01 00:00:00     700    1000
6       c  2014-07-01 00:00:00    1200    1000
7       c  2014-08-01 00:00:00     200     100
8       c  2014-09-01 00:00:00     200     300

In [22]: df['variance'] = df['budget'] + df['actual']

In [23]: df
Out[23]:
  cluster                 date  budget  actual  variance
0       a  2014-01-01 00:00:00   11000   10000     21000
1       a  2014-02-01 00:00:00    1200    1000      2200
2       a  2014-03-01 00:00:00     200     100       300
3       b  2014-04-01 00:00:00     200     300       500
4       b  2014-05-01 00:00:00     400     450       850
5       c  2014-06-01 00:00:00     700    1000      1700
6       c  2014-07-01 00:00:00    1200    1000      2200
7       c  2014-08-01 00:00:00     200     100       300
8       c  2014-09-01 00:00:00     200     300       500

顺便提一下，你不应该使用sum作为变量名，因为这会覆盖内置的sum函数。