将pandas数据框的列映射到字典

Question

3

我有一个包含高基数（许多唯一值）的分类变量数据框案例。我想将该变量重新编码为一组值（最常见的前几个值），并用一个捕获所有类别("其他")替换所有其他值。举个简单的例子：

以下是应保持不变的两个值：

top_values = ['apple', 'orange']

我根据下列数据框列中的频率建立了它们：

{'fruits': {0: 'apple',
1: 'apple',
2: 'orange',
3: 'orange',
4: 'banana',
5: 'grape'}}

那个数据框列应该按照以下方式重新编码:

{'fruits': {0: 'apple',
1: 'apple',
2: 'orange',
3: 'orange',
4: 'other',
5: 'other'}}

怎么做呢？（数据框中有数百万条记录）

- Nick

2个回答

1

df.newCol = df.apply(lambda row: row.fruits if row.fruits in top_values else 'others' )

- Venkatachalam

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jpp · Accepted Answer

您至少可以使用以下几种方法：

df['fruits'].where(df['fruits'].isin(top_values), 'other', inplace=True)

df.loc[~df['fruits'].isin(top_values), 'fruits'] = 'other'

这个过程完成后，您可能会希望将系列转换为类别类型：

df['fruits'] = df['fruits'].astype('category')

在进行值替换操作之前这样做可能没有帮助，因为您的输入序列具有高基数。