在Pandas数据框中创建一个value_counts列

Question

在Pandas数据框中创建一个value_counts列

84

我想从Pandas数据框的一个列中创建唯一值的计数，然后将这些计数添加为新列到原始数据框中。我已经尝试了几种不同的方法。我创建了一个Pandas系列，然后使用value_counts方法计算计数。我试图将这些值合并回我的原始数据框，但我想要合并的键在Index(ix/loc)中。

Color Value
Red   100
Red   150
Blue  50

我希望返回类似以下的内容：

Color Value Counts
Red   100   2
Red   150   2 
Blue  50    1

- user2592989

2

这是最近一个流行的问题。请查看此问题链接，该问题与您的情况几乎完全相同。 - bdiamante

8个回答

21

另一种选择：

z = df['Color'].value_counts 

z1 = z.to_dict() #converts to dictionary

df['Count_Column'] = df['Color'].map(z1)

此选项将为您提供一列具有重复值的计数，对应于“Color”列中每个值的频率。

- ZakS

21

可以简化为：df['Count_Column'] = df['Color'].map(df['Color'].value_counts())。您可以使用一个序列来映射（不一定是字典）。 - sacuL

13

这个答案使用了 Series.map 和 Series.value_counts。它是在 Pandas 1.1 中测试的。

df['counts'] = df['attribute'].map(df['attribute'].value_counts())

来源: sacuL的评论

- Asclepius

5

df ['Counts'] = df.Color.groupby(df.Color).transform('count')

你可以对任何系列执行此操作：将其分组并调用transform（'count'）：

>>> series = pd.Series(['Red', 'Red', 'Blue'])
>>> series.groupby(series).transform('count')
0    2
1    2
2    1
dtype: int64

- 1''

3

我的初步想法是使用如下所示的列表推导式，但是正如评论中指出的那样，这比使用groupby和transform方法要慢。我将保留此答案以展示不要这样做：

In [94]: df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]})
In [95]: df['Counts'] = [sum(df['Color'] == df['Color'][i]) for i in xrange(len(df))]
In [96]: df
Out[100]: 
  Color  Value  Counts
0   Red    100       2
1   Red    150       2
2  Blue     50       1

[3 rows x 3 columns]

@unutbu的方法在具有多个列的数据框中变得复杂，这使得编码更加简单。如果您正在使用小型数据框，则此方法更快（请参见下文），但否则，您不应该使用此方法。

In [97]: %timeit df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]}); df['Counts'] = df.groupby(['Color']).transform('count')
100 loops, best of 3: 2.87 ms per loop
In [98]: %timeit df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]}); df['Counts'] = [sum(df['Color'] == df['Color'][i]) for i in xrange(len(df))]
1000 loops, best of 3: 1.03 ms per loop

- Steven C. Howell

4

3行的示例在计时方面非常误导人。试试用更大的数据框来测试，你会发现按组分组的方法要快得多（我使用了1000次重复的df (df = pd.concat([df]*1000, ignore_index=True)), 得到的结果是 3.6 毫秒(按组分组) vs 29 秒 (列表推导式)）。此外，我认为按组分组的方法更简单。 - joris

0

虽然这里已经有很多好的回答了，但我个人认为使用：

(假设一个数据框为df)

df['new_value_col'] = df.groupby('colname_to_count')['colname_to_count'].transform('count')

是最好和最直接的选项之一。我想提供另一种我成功使用过的方法。

import pandas as pd
import numpy as np

df['new_value_col'] = df.apply(lambda row: np.sum(df['col_to_count'] == row['col_to_count'], axis=1)

我们基本上是将要计数的列转换为lambda表达式内部的系列，然后使用np.sum来计算系列中每个值的出现次数。

认为这可能有用，多种选择总是好的！

- Jeff W

0

使用nunique命令和dropna来减少NaN值。还在Google Collab中进行了测试。

 df = pd.DataFrame({'Color': ['Red', 'Red', 'Blue'], 'Value': [100, 150, 50]})
    total_counts = df.groupby('Color')['Value'].nunique(dropna=True)
    df['Counts'] = df['Color'].transform(lambda x: total_counts[x])
    print(df)

要更好地理解nunique阅读此博客。的内容。

- Nimra Tahir

0

创建一个包含重复值计数的列。这些值是从其他列计算出来的临时计算结果。非常快速。感谢@ZakS。

sum_A_B = df['A']+df['B']
sum_A_B_dict = sum_A_B.value_counts().to_dict()
df['sum_A_B'] = sum_A_B.map(sum_A_B_dict)

- BSalita

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- unutbu · Accepted Answer

df['Counts'] = df.groupby(['Color'])['Value'].transform('count')

例如，

In [102]: df = pd.DataFrame({'Color': 'Red Red Blue'.split(), 'Value': [100, 150, 50]})

In [103]: df
Out[103]: 
  Color  Value
0   Red    100
1   Red    150
2  Blue     50

In [104]: df['Counts'] = df.groupby(['Color'])['Value'].transform('count')

In [105]: df
Out[105]: 
  Color  Value  Counts
0   Red    100       2
1   Red    150       2
2  Blue     50       1

请注意，transform('count') 会忽略NaN。如果您想要计算NaN，请使用transform(len)。

对于匿名编辑者：如果您在使用transform('count')时遇到错误，可能是因为您的Pandas版本过旧。以上内容适用于Pandas版本0.15或更高版本。