从R到Python的case_when函数

Question

从R到Python的case_when函数

pythonpandasdataframedata-analysis

20

如何在Python代码中实现R语言的case_when函数？

以下是R语言的case_when函数：

https://www.rdocumentation.org/packages/dplyr/versions/0.7.8/topics/case_when

假设我们有以下数据框（下面是Python代码）：

import pandas as pd
import numpy as np

data = {'name': ['Jason', 'Molly', 'Tina', 'Jake', 'Amy'], 
        'age': [42, 52, 36, 24, 73], 
        'preTestScore': [4, 24, 31, 2, 3],
        'postTestScore': [25, 94, 57, 62, 70]}
df = pd.DataFrame(data, columns = ['name', 'age', 'preTestScore', 'postTestScore'])
df

假设我们想创建一个名为“elderly”的新列，该列查看“age”列并执行以下操作：

if age < 10 then baby
 if age >= 10 and age < 20 then kid 
if age >=20 and age < 30 then young 
if age >= 30 and age < 50 then mature 
if age >= 50 then grandpa

有人能帮忙吗？

- msh855

5个回答

12

np.select非常好，因为它是一种根据条件将值分配给choicelist中元素的通用方法。

然而，对于OP尝试解决的特定问题，使用pandas的cut方法可以以简洁的方式实现相同的效果。method.


bin_cond = [-np.inf, 10, 20, 30, 50, np.inf]            # think of them as bin edges
bin_lab = ["baby", "kid", "young", "mature", "grandpa"] # the length needs to be len(bin_cond) - 1
df["elderly2"] = pd.cut(df["age"], bins=bin_cond, labels=bin_lab)

#     name  age  preTestScore  postTestScore  elderly elderly2
# 0  Jason   42             4             25   mature   mature
# 1  Molly   52            24             94  grandpa  grandpa
# 2   Tina   36            31             57   mature   mature
# 3   Jake   24             2             62    young    young
# 4    Amy   73             3             70  grandpa  grandpa

- Alby

3

pyjanitor在dev中有一个case_when的实现，可能对这种情况有所帮助，其实现思路受到了pydatatable中的if_else和R的data.table中的fcase的启发；在底层，它使用了pd.Series.mask：

# pip install git+https://github.com/pyjanitor-devs/pyjanitor.git
import pandas as pd
import janitor as jn

df.case_when(
df.age.lt(10), 'baby', # 1st condition, result
df.age.between(10, 20, 'left'), 'kid', # 2nd condition, result
df.age.between(20, 30, 'left'), 'young', # 3rd condition, result
 df.age.between(30, 50, 'left'), 'mature', # 4th condition, result
'grandpa',  # default if none of the conditions match
 column_name = 'elderly') # column name to assign to
 
    name  age  preTestScore  postTestScore  elderly
0  Jason   42             4             25   mature
1  Molly   52            24             94  grandpa
2   Tina   36            31             57   mature
3   Jake   24             2             62    young
4    Amy   73             3             70  grandpa

Alby的解决方案在这种情况下比if/else结构更有效。

- sammywemmy

1

仅供将来参考，现在您可以使用pandas cut或map进行中等到良好的速度。如果您需要更快的东西，它可能不适合您的需求，但对于日常使用和批处理已经足够了。

import pandas as pd

如果您想选择 map 或 apply 挂载您的范围，并在范围内返回一些内容

def calc_grade(age):
        if 50 < age < 200:
            return 'Grandpa'
        elif 30 <= age <=50:
            return 'Mature'
        elif 20 <= age < 30:
            return 'Young'
        elif 10 <= age < 20:
            return 'Kid'
        elif age < 10:
            return 'Baby'

%timeit df['elderly'] = df['age'].map(calc_grade)

	名称	年龄	前测成绩	后测成绩	老年人
0	Jason	42	4	25	成熟
1	Molly	52	24	94	爷爷奶奶级别
2	Tina	36	31	57	成熟
3	Jake	24	2	62	年轻人
4	Amy	73	3	70	爷爷奶奶级别

每个循环393微秒±8.43微秒（7次运行的平均值±标准偏差，每个1000个循环）

如果你想选择切割，应该有很多选项。一种方法是 - 我们包括左侧，排除右侧。对于每个箱子，一个标签。

bins = [0, 10, 20, 30, 50, 200] #200 year Vampires are people I guess...you could change to a date you belieave plausible.
labels = ['Baby','Kid','Young', 'Mature','Grandpa']

%timeit df['elderly'] = pd.cut(x=df.age, bins=bins, labels=labels , include_lowest=True, right=False, ordered=False)

	姓名	年龄	前测成绩	后测成绩	老年人
0	杰森	42	4	25	成熟
1	莫莉	52	24	94	爷爷奶奶
2	蒂娜	36	31	57	成熟
3	杰克	24	2	62	年轻人
4	艾米	73	3	70	爷爷奶奶

- Hildermes José Medeiros Filho

0

利用numpy的稳定性，您可以创建一个函数并使用map或apply与lambda：

def elderly_function(age):
 if age < 10:
  return 'baby'
 if age < 20:
  return 'kid'
 if age < 30
  return 'young'
 if age < 50:
  return 'mature'
 if age >= 50:
  return 'grandpa'

df["elderly"] = df["age"].map(lambda x: elderly_function(x))
# Works with apply as well:
df["elderly"] = df["age"].apply(lambda x: elderly_function(x))

使用numpy的解决方案可能会更快，如果您的数据框相当大，则可能是首选。

- Fernando Rocha Urbano

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex · Accepted Answer

您想要使用np.select：

conditions = [
    (df["age"].lt(10)),
    (df["age"].ge(10) & df["age"].lt(20)),
    (df["age"].ge(20) & df["age"].lt(30)),
    (df["age"].ge(30) & df["age"].lt(50)),
    (df["age"].ge(50)),
]
choices = ["baby", "kid", "young", "mature", "grandpa"]

df["elderly"] = np.select(conditions, choices)

# Results in:
#      name  age  preTestScore  postTestScore  elderly
#  0  Jason   42             4             25   mature
#  1  Molly   52            24             94  grandpa
#  2   Tina   36            31             57   mature
#  3   Jake   24             2             62    young
#  4    Amy   73             3             70  grandpa

conditions和choices列表必须具有相同的长度。
当所有conditions求值为False时，还可以使用一个default参数。