Pandas基于多列计算值的最快方法

Question

Pandas基于多列计算值的最快方法

3

给定这样的一个数据框：

df = pd.DataFrame({'id' : ['1', '2', '3', '4', '5'],
                   'c1' : ['a', 'a', 'a', 'a', 'a'],
                   'c2' : ['a', 'a', 'b', 'c', 'c']})
df
Out[5]: 
  id c1 c2
0  1  a  a
1  2  a  a
2  3  a  b
3  4  a  c
4  5  a  c

我想在基于c1和c2的数值计数的新列中添加一列。我的当前代码是：

df['count'] = df.groupby(['c1', 'c2'], dropna=False)['id'].transform('count')

df['result'] = np.where(df['count'] > 1, True, False)

df
Out[7]: 
  id c1 c2  count  result
0  1  a  a      2    True
1  2  a  a      2    True
2  3  a  b      1   False
3  4  a  c      2    True
4  5  a  c      2    True

有更快的方式吗？

- Henri Marteville

3个回答

1

尝试

from collections import Counter
c = Counter(zip(df['c1'],df['c2']))
pd.DataFrame.from_dict(c,orient='index', columns = ['count'])

从我的%%timeit来看，它似乎快了一个数量级...

- ntg

你要如何创建一个新的DataFrame列呢？ - ignoring_gravity

将其更改为结果数据框... 无论如何，除非您正在进行某些极端操作，否则我建议您采用@Andreas提出的方案。但无论如何，避免信息重复。 - ntg

1

你可以尝试这个：

cols = ['c1', 'c2']
df = df.merge(df[cols].value_counts().rename('count'), on=cols, how='left')
df['result'] = df['count'].gt(1)

- Andreas

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ignoring_gravity · Accepted Answer

我认为没有更快的方法，但np.where是不必要的，你可以直接使用df['count'] > 1，这样更容易读取：)

虽然这并不会改变执行速度，但我认为你已经很快了 - 它的执行速度还不够快吗？

In [13]: %%timeit
    ...: df['result'] = np.where(df['count'] > 1, True, False)
    ...: 
    ...: 
225 µs ± 16.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [14]: %%timeit
    ...: df['result'] = df['count'] > 1
    ...: 
    ...: 
225 µs ± 11.4 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

编辑

根据下面的答案，Counter似乎可以提供加速：

In [27]: %%timeit
    ...: combined = df['c1'].astype(str)+df['c2'].astype(str)
    ...: df['count'] = combined.map(Counter(combined))
    ...: 
    ...: 
374 µs ± 6.8 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)

In [28]: %%timeit
    ...: df['count'] = df.groupby(['c1', 'c2'], dropna=False)['id'].transform('c
    ...: ount')
    ...: 
    ...: 
781 µs ± 38.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)