Pandas:如何按ID分组计算分类特征的出现次数

5
假设我有这个DataFrame df:
My_ID   My_CAT
  1       A
  2       B  
  3       C
  1       A  
  1       B 
  2       D 

我希望知道每个不同的 My_ID 中有多少个不同的 My_Cat 值。

我需要一个形式为密集数组的答案,如下所示:

My_ID   A    B    C   D
  1     2    1    0   0
  2     0    1    0   1
  3     0    0    1   0

我试过了

df.groupby(['My_ID','My_CAT']).count()

但是,尽管我看到数据按照我的需求进行了分组,但出现次数并没有被计算。
2个回答

5

使用 crosstab (输入更少,但是稍慢):

df = pd.crosstab(df['My_ID'], df['My_CAT'])
print (df)
My_CAT  A  B  C  D
My_ID             
1       2  1  0  0
2       0  1  0  1
3       0  0  1  0

更快的解决方案使用 groupby + 聚合函数 size + unstack:
df = df.groupby(['My_ID','My_CAT']).size().unstack(fill_value=0)
print (df)
My_CAT  A  B  C  D
My_ID             
1       2  1  0  0
2       0  1  0  1
3       0  0  1  0

最后:

df = df.reset_index().rename_axis(None, axis=1)
print (df)
   My_ID  A  B  C  D
0      1  2  1  0  0
1      2  0  1  0  1
2      3  0  0  1  0

注意:

在pandas中,size和count有什么区别?

时间(更大的数据):

np.random.seed(123)
N = 100000
L = list('abcdefghijklmno')
df = pd.DataFrame({'My_CAT': np.random.choice(L, N),
                   'My_ID':np.random.randint(1000,size=N)})
print (df)

In [79]: %timeit pd.crosstab(df['My_ID'], df['My_CAT'])
10 loops, best of 3: 96.7 ms per loop

In [80]: %timeit df.groupby(['My_ID','My_CAT']).size().unstack(fill_value=0)
100 loops, best of 3: 14.2 ms per loop

In [81]: %timeit pd.get_dummies(df.My_CAT).groupby(df.My_ID).sum()
10 loops, best of 3: 25.5 ms per loop

In [82]: %timeit df.groupby('My_ID').My_CAT.value_counts().unstack(fill_value=0)
10 loops, best of 3: 25.4 ms per loop

In [136]: %timeit xtab_df(df, 'My_ID', 'My_CAT')
100 loops, best of 3: 4.23 ms per loop

In [137]: %timeit xtab(df, 'My_ID', 'My_CAT')
100 loops, best of 3: 4.61 ms per loop

2

pd.get_dummies with groupby

pd.get_dummies(df.My_CAT).groupby(df.My_ID).sum().reset_index()

   My_ID  A  B  C  D
0      1  2  1  0  0
1      2  0  1  0  1
2      3  0  0  1  0

groupby with value_counts

df.groupby('My_ID').My_CAT.value_counts() \
  .unstack(fill_value=0).rename_axis(None, 1).reset_index()

   My_ID  A  B  C  D
0      1  2  1  0  0
1      2  0  1  0  1
2      3  0  0  1  0

因式分解(factorize)numba
这是我的实验性提案。

from numba import njit
import pandas as pd
import numpy as np

@njit
def xtab_array(f1, f2, m, n):
    v = np.arange(m * n).reshape(m, n) * 0
    for i in range(f1.size):
        v[f1[i], f2[i]] += 1
    return v

def xtab_df(df, c1, c2):
    f1, u1 = pd.factorize(df[c1].values)
    f2, u2 = pd.factorize(df[c2].values)
    v = xtab_array(f1, f2, u1.size, u2.size)
    return pd.DataFrame(
        np.column_stack([u1, v]), columns=['My_ID'] + u2.tolist()
    )

xtab_df(df, 'My_ID', 'My_CAT')

   My_ID  A  B  C  D
0      1  2  1  0  0
1      2  0  1  0  1
2      3  0  0  1  0

纯粹的numpy

def xtab(df, c1, c2):
    f1, u1 = pd.factorize(df[c1].values)
    f2, u2 = pd.factorize(df[c2].values)
    n, m = u1.size, u2.size
    v = np.bincount(f1 * m + f2)
    v = np.append(v, np.zeros(n * m - v.size)).reshape(n, -1)
    return pd.DataFrame(
        np.column_stack([u1, v]), columns=['My_ID'] + u2.tolist()
    )

xtab(df, 'My_ID', 'My_CAT')

   My_ID  A  B  C  D
0      1  2  1  0  0
1      2  0  1  0  1
2      3  0  0  1  0

时序
小数据

%timeit pd.crosstab(df['My_ID'], df['My_CAT'])
%timeit df.groupby(['My_ID','My_CAT']).size().unstack(fill_value=0)
%timeit pd.get_dummies(df.My_CAT).groupby(df.My_ID).sum()
%timeit df.groupby('My_ID').My_CAT.value_counts().unstack(fill_value=0)
%timeit xtab_df(df, 'My_ID', 'My_CAT')
%timeit xtab(df, 'My_ID', 'My_CAT')

100 loops, best of 3: 5.21 ms per loop
1000 loops, best of 3: 1.23 ms per loop
1000 loops, best of 3: 1.2 ms per loop
1000 loops, best of 3: 1.23 ms per loop
1000 loops, best of 3: 280 µs per loop
1000 loops, best of 3: 298 µs per loop
< p >@jezrael的更大数据

np.random.seed(123)
N = 100000
L = list('abcdefghijklmno')
df = pd.DataFrame({'My_CAT': np.random.choice(L, N),
                   'My_ID':np.random.randint(1000,size=N)})

%timeit pd.crosstab(df['My_ID'], df['My_CAT'])
%timeit df.groupby(['My_ID','My_CAT']).size().unstack(fill_value=0)
%timeit pd.get_dummies(df.My_CAT).groupby(df.My_ID).sum()
%timeit df.groupby('My_ID').My_CAT.value_counts().unstack(fill_value=0)
%timeit xtab_df(df, 'My_ID', 'My_CAT')
%timeit xtab(df, 'My_ID', 'My_CAT')

10 loops, best of 3: 82.6 ms per loop
100 loops, best of 3: 10.7 ms per loop
100 loops, best of 3: 15.6 ms per loop
10 loops, best of 3: 19.9 ms per loop
100 loops, best of 3: 3.01 ms per loop
100 loops, best of 3: 3.22 ms per loop

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接