Pandas：在Pandas 0.23.4中按两列组合进行分组

Question

Pandas：在Pandas 0.23.4中按两列组合进行分组

3

我对Python还不太熟悉。我在SO上看到了一个关于Pandas和两个列的组合的问题：Pandas：按两个列的组合进行分组。不幸的是，这个被接受的答案在pandas版本0.23.4中不再起作用。那篇文章的目的是找出组合变量，并为值创建一个字典。也就是说，group_by应该忽略分组的顺序。

以下是被接受的答案：

import pandas as pd
from collections import Counter

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

这里，...apply(sorted) 抛出下面的异常:

raise ValueError('Must have equal len keys and value ' ValueError: Must have equal len keys and value when setting with an iterable

这是我的 pandas 版本:

> pd.__version__
Out: '0.23.4'

在阅读完https://pandas.pydata.org/pandas-docs/stable/generated/pandas.DataFrame.sort_values.html后，以下是我尝试的内容：

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d=d.sort_values(by=['x','y'],axis=1).reset_index(drop=True)
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

不幸的是，这也会抛出错误：

在_get_label_or_level_values中的1382行报错：KeyError(key) KeyError: 'x'

期望输出：

        score           count
x   y                     
a   b   {1: 1, 3: 2}      2
    c   {2: 1}            1

请问有人可以帮帮我吗？另外，如果您能指导如何计算score列中keys()的数量，并提供向量化解决方案，那将非常棒。

我正在使用Python 3.6.7。

非常感谢。

- watchtower

3个回答

1

使用 -

a=d[['x','y']].values
a.sort(axis=1)
d[['x','y']] = a
x = d.groupby(['x', 'y']).agg(Counter)
print(x)

输出

            score
x y              
a b  {1: 1, 3: 2}
  c        {2: 1}

- Vivek Kalyanarangan

1

将.apply()的参数之一设置为result_type = 'broadcast'即可解决问题。

>>> d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
             columns=['x', 'y', 'score'])
>>> d[['x', 'y']] = d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')
>>> x = d.groupby(['x', 'y']).agg(Counter)
>>> print(x)

            score
x y              
a b  {1: 1, 3: 2}
  c        {2: 1}

请注意有和没有 result_type = 'broadcast' 的区别。

>>> d[['x', 'y']].apply(sorted, axis=1)

0    [a, b]
1    [a, c]
2    [a, b]
3    [a, b]
dtype: object

>>> d[['x', 'y']].apply(sorted, axis=1, result_type='broadcast')

   x  y
0  a  b
1  a  c
2  a  b
3  a  b

正如您所看到的，result_type = 'broadcast' 将 .apply() 的结果从列表中分割（广播）到各自的列中，允许将其赋值给 d[['x', 'y']]。

- TrebledJ

1

好好知道 :-) - TrebledJ

@Trebuchet：谢谢。这对我也起作用了，但请注意，pandas指出：“broadcast: Deprecated since version 0.23.0: This argument will be removed in a future version, replaced by result_type=’broadcast’。” - watchtower

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

问题在于sorted返回的是列表，因此需要将其转换为Series:

d[['x', 'y']] = d[['x', 'y']].apply(lambda x: pd.Series(sorted(x)), axis=1)

但更快的方法是使用numpy.sort和DataFrame构造函数，因为apply在底层使用循环：

d = pd.DataFrame([('a','b',1), ('a','c', 2), ('b','a',3), ('b','a',3)],
                 columns=['x', 'y', 'score'])

d[['x', 'y']] = pd.DataFrame(np.sort(d[['x', 'y']], axis=1), index=d.index)

然后选择聚合列，并使用聚合函数列表 - 例如，nunique 用于计算唯一值的数量：

x = d.groupby(['x', 'y'])['score'].agg([Counter, 'nunique'])
print(x)
          Counter  nunique
x y                       
a b  {1: 1, 3: 2}        2
  c        {2: 1}        1

或者使用 DataFrameGroupBy.size 进行计数：

x = d.groupby(['x', 'y'])['score'].agg([Counter, 'size'])
print(x)
          Counter  size
x y                    
a b  {1: 1, 3: 2}     3
  c        {2: 1}     1