Python:计算Pandas系列中值的累积出现次数

7

我有一个类似这样的DataFrame:

    fruit
0  orange
1  orange
2  orange
3    pear
4  orange
5   apple
6   apple
7    pear
8    pear
9  orange

我想添加一列来统计每个值的累积出现次数,即:
    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5

目前我是这样做的:

df['cum_count'] = [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]

当我处理10行数据时,这种方法是可行的,但当我尝试用数百万行数据做同样的事情时,会花费很长时间。有没有更高效的方法来处理这个问题?

2个回答

6
您可以使用 groupbycumcount 进行分组与计数:
df['cum_count'] = df.groupby('fruit').cumcount() + 1

In [16]: df
Out[16]:
    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5

时间控制

In [8]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
100 loops, best of 3: 3.76 ms per loop

In [9]: %timeit df.groupby('fruit').cumcount() + 1
1000 loops, best of 3: 926 µs per loop

所以它的速度快了4倍。

抱歉,为什么要在你的时序中使用我的解决方案?请添加你自己的解决方案或者移除时序。 - jezrael
@jezrael 对不起,是我复制错了。 - Anton Protopopov

3
也许更好的方法是使用groupbycumcount,并指定列名。这是一种更有效的方式。
df['cum_count'] = df.groupby('fruit' )['fruit'].cumcount() + 1
print df

    fruit  cum_count
0  orange          1
1  orange          2
2  orange          3
3    pear          1
4  orange          4
5   apple          1
6   apple          2
7    pear          2
8    pear          3
9  orange          5

比较 len(df) = 10,我的解决方案是最快的:

In [3]: %timeit df.groupby('fruit')['fruit'].cumcount() + 1
The slowest run took 11.67 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 299 µs per loop

In [4]: %timeit df.groupby('fruit').cumcount() + 1
The slowest run took 12.78 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 921 µs per loop

In [5]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
The slowest run took 4.47 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.72 ms per loop

比较 len(df) = 10k

In [7]: %timeit df.groupby('fruit')['fruit'].cumcount() + 1
The slowest run took 4.65 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 845 µs per loop

In [8]: %timeit df.groupby('fruit').cumcount() + 1
The slowest run took 5.59 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 1.59 ms per loop

In [9]: %timeit [(df.fruit[0:i+1] == x).sum() for i, x in df.fruit.iteritems()]
1 loops, best of 3: 5.12 s per loop

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接