如何在Python pandas数据框中计算选择值的频率

Question

如何在Python pandas数据框中计算选择值的频率

pythondataframefrequency

3

我有一个包含两列的数据框，一列是名字，另一列是字符串值。

我的目标是通过名字计算选定字符串值的频率。

我尝试了 pandas.pivot_table 和 pandas.DataFrame.groupby，但我想创建一个全新的数据框而不是聚合。

例如，我有一个数据框：

import pandas as pd
import numpy as np

data = np.array([['John', 'x'], ['John', 'x'], ['John', 'x'], ['John', 'y'], ['John', 'y'], ['John', 'a'], 
                 ['Will', 'x'], ['Will', 'z']])

df = pd.DataFrame(data, columns=['name','str_value'])
df

这将导致：

   name      str_value
0  John              x
1  John              x
2  John              x
3  John              y
4  John              y
5  John              a
6  Will              x
7  Will              z

预期结果将是：

   name        x        y        z
0  John        3        2        0 
1  Will        1        0        1

而且：

   name        x        y        z
0  John     True     True    False 
1  Will     True    False     True

我想只选择 x、y 和 z 并根据返回的值是否为 0 或 NaN 返回 True 或 False。

编辑：感谢回答。这些方法很好用，但输出结果带有子组 "str_value"：

str_value     x      y      z
name
John       True   True  False
Will       True  False   True

有没有办法将它移除，以便我在同一级别上获得“name”、“x”、“y”、“z”？使用 .reset_index() 我会得到：

str_value  name     x      y      z
0          John  True   True  False
1          Will  True  False   True

现在我的索引名称是“str_value”吗？我可以重命名或删除它吗？

- user9995348

3个回答

2

除了其他优秀的答案外，您可以使用混合的groupby unstack和astype(bool)来实现一行代码:

最初的回答：

df1 = df.loc[df.str_value != 'a'] # remove a as requested.
df2 = df1.groupby(["name", "str_value"])["str_value"].count().unstack().fillna(False).astype(
bool)
print(df2)
    name    x   y   z
0   John    True    True    False
1   Will    True    False   True

- Umar.H

1

你可以尝试：

df.groupby(["name", "str_value"]).size().unstack()[['x', 'y', 'z']].gt(0)

解释:

使用groupby和size计算每个name和str_value的出现次数：

print(df.groupby(["name", "str_value"]).size())
# John  a            1
#       x            3
#       y            2
# Will  x            1
#       z            1
# dtype: int64

使用 unstack 进行解堆操作

print(df.groupby(["name", "str_value"]).size().unstack())
# str_value    a    x    y    z
# name
# John       1.0  3.0  2.0  NaN
# Will       NaN  1.0  NaN  1.0

选择所需的列：

print(df.groupby(["name", "str_value"]).size().unstack()[['x', 'y', 'z']])
# str_value    x    y    z
# name
# John       3.0  2.0  NaN
# Will       1.0  NaN  1.0

与大于0的值进行比较，使用gt：

result = df.groupby(["name", "str_value"]).size().unstack()[['x', 'y', 'z']].gt(0)
print(result)
# str_value     x      y      z
# name
# John       True   True  False
# Will       True  False   True

- Alexandre B.

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pierre V. · Accepted Answer

结合使用groupby和pivot：

total = df.groupby(["name", "str_value"]).size().reset_index(level=1, name="total")
counts = total.pivot(columns="str_value", values="total").fillna(0).drop(columns=["a"])
bools = counts > 0.0