使用条件语句在pandas数据框中生成新列

Question

使用条件语句在pandas数据框中生成新列

pythonpandasconditional-statementscalculated-columns

30

我有一个类似于这样的pandas数据框：

   portion  used
0        1   1.0
1        2   0.3
2        3   0.0
3        4   0.8

我想基于used列创建一个新的列，使得df看起来像这样:

   portion  used    alert
0        1   1.0     Full
1        2   0.3  Partial
2        3   0.0    Empty
3        4   0.8  Partial

基于现有数据创建一个新的alert列。
如果used为1.0，则alert应为Full。
如果used为0.0，则alert应为Empty。
否则，alert应为Partial。

如何最好地实现这个功能？

- user3786999

可能是[Pandas条件创建系列/数据框列]的重复问题。 (https://dev59.com/os-90IgBFxS5KdRjteUd) - chrisb

6个回答

45

或者你可以这样做：

import pandas as pd
import numpy as np
df = pd.DataFrame(data={'portion':np.arange(10000), 'used':np.random.rand(10000)})

%%timeit
df.loc[df['used'] == 1.0, 'alert'] = 'Full'
df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'

这个方法输出相同，但在10000行上运行速度约快100倍：

100 loops, best of 3: 2.91 ms per loop

然后使用apply：

%timeit df['alert'] = df.apply(alert, axis=1)

1 loops, best of 3: 287 ms per loop

我猜这个选择取决于你的数据框有多大。

- Primer

对于 %timeit 函数的疑问：如果第一个函数执行 100 次循环耗时 2.91 秒，那么总时间就是 291 毫秒，略长于 alert 函数完成一个循环所需的 287 毫秒吗？ - Nate

1

在这种情况下，1个循环在%%timeit之后运行3行代码。循环次数（在此为100）由timeit程序自动选择，以在一些合理的“超时”内提供更稳健的度量（即如果运行1个循环的时间长于此“超时”，则只会有1个循环，例如使用apply的情况）。应该按“每个循环”基础比较timeit的结果。这就是为什么有“大约快100倍”的短语：花费2.91毫秒的1个循环大约比花费287毫秒的1个循环快100倍。 - Primer

21

使用np.where通常很快

In [845]: df['alert'] = np.where(df.used == 1, 'Full', 
                                 np.where(df.used == 0, 'Empty', 'Partial'))

In [846]: df
Out[846]:
   portion  used    alert
0        1   1.0     Full
1        2   0.3  Partial
2        3   0.0    Empty
3        4   0.8  Partial

_时间安排

In [848]: df.shape
Out[848]: (100000, 3)

In [849]: %timeit df['alert'] = np.where(df.used == 1, 'Full', np.where(df.used == 0, 'Empty', 'Partial'))
100 loops, best of 3: 6.17 ms per loop

In [850]: %%timeit
     ...: df.loc[df['used'] == 1.0, 'alert'] = 'Full'
     ...: df.loc[df['used'] == 0.0, 'alert'] = 'Empty'
     ...: df.loc[(df['used'] >0.0) & (df['used'] < 1.0), 'alert'] = 'Partial'
     ...:
10 loops, best of 3: 21.9 ms per loop

In [851]: %timeit df['alert'] = df.apply(alert, axis=1)
1 loop, best of 3: 2.79 s per loop

- Zero

1

如果您的条件不太复杂，那么这应该是被接受的答案。 - François Leblanc

6

使用`np.select()`处理多于2个条件

如果有多于2个条件，像OP的例子一样，np.select()比嵌套多个层级的np.where()更清晰（并且速度一样快）。

Either define the conditions/choices as two lists (paired element-wise) with an optional default value ("else" case):

conditions = [
    df.used.eq(0),
    df.used.eq(1),
]
choices = [
    'Empty',
    'Full',
]
df['alert'] = np.select(conditions, choices, default='Partial')

Or define the conditions/choices as a dictionary for maintainability (easier to keep them paired properly when making additions/revisions):

conditions = {
    'Empty': df.used.eq(0),
    'Full': df.used.eq(1),
}
df['alert'] = np.select(conditions.values(), conditions.keys(), default='Partial')

`np.select()`非常快

在具有5个条件（完整、高、中、低、空的）的情况下，计时如下：

^{df = pd.DataFrame({'used': np.random.randint(10 + 1, size=10)}).div(10)}

- tdy

你有制作这个答案图表的代码或示例吗？我想向一些人展示它。 - scarebear

1

这是一个 perfplot @scarebear。 - Henry Ecker

1

无法评论，所以创建一个新答案：在Ffisegydd的方法基础上进行改进，您可以使用字典和 dict.get() 方法来使传递给 .apply() 的函数更易于管理：

import pandas as pd

def alert(c):
    mapping = {1.0: 'Full', 0.0: 'Empty'}
    return mapping.get(c['used'], 'Partial')

df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})

df['alert'] = df.apply(alert, axis=1)

根据使用情况，您可能也希望在函数定义之外定义字典。

- Hansang

1

df['TaxStatus'] = np.where(df.Public == 1, True, np.where(df.Public == 2, False))

这看起来似乎可行，但出现了ValueError错误：x和y应该同时给出或者都不给。

- manager_matt

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ffisegydd · Accepted Answer

你可以定义一个函数，返回你的不同状态，例如"满"，"部分"，"空"等，并使用df.apply将该函数应用到每一行。请注意，你必须传递关键字参数axis=1，以确保将函数应用于行。

import pandas as pd

def alert(row):
  if row['used'] == 1.0:
    return 'Full'
  elif row['used'] == 0.0:
    return 'Empty'
  elif 0.0 < row['used'] < 1.0:
    return 'Partial'
  else:
    return 'Undefined'

df = pd.DataFrame(data={'portion':[1, 2, 3, 4], 'used':[1.0, 0.3, 0.0, 0.8]})

df['alert'] = df.apply(alert, axis=1)

#    portion  used    alert
# 0        1   1.0     Full
# 1        2   0.3  Partial
# 2        3   0.0    Empty
# 3        4   0.8  Partial

使用条件语句在pandas数据框中生成新列

使用np.select()处理多于2个条件

np.select()非常快

使用`np.select()`处理多于2个条件

`np.select()`非常快