Pandas分位数函数非常缓慢。

6
我想在Pandas数据框上计算分位数/百分位数。 但是,该函数非常缓慢。 我在Numpy中重复了它,并发现在Pandas中计算所需时间几乎是Numpy的10000倍!
有人知道原因吗? 我应该使用Numpy计算然后创建一个新的DataFrame而不是使用Pandas吗?
请查看下面的代码:
import time
import pandas as pd
import numpy as np

q = np.array([0.1,0.4,0.6,0.9])
data = np.random.randn(10000, 4)
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
time1 = time.time()
pandas_quantiles = df.quantile(q, axis=1)
time2 = time.time()
print 'Pandas took %0.3f ms' % ((time2-time1)*1000.0)

time1 = time.time()
numpy_quantiles = np.percentile(data, q*100, axis=1)
time2 = time.time()
print 'Numpy took %0.3f ms' % ((time2-time1)*1000.0)

print (pandas_quantiles.values == numpy_quantiles).all()
# Output:
# Pandas took 15337.531 ms
# Numpy took 1.653 ms
# True

3
目前这个实现相当低效。请在此处提交一个问题,附带可复制的示例。欢迎提交修复的pull-request! - Jeff
1个回答

1

最近版本的Python 3中,Pandas已经解决了这个问题。

在小数组上,Pandas比以前慢不到两倍,在大数组上差异为5%。

我使用Pandas 0.24.1和Python 3得到了以下输出:

import time
import pandas as pd
import numpy as np

q = np.array([0.1,0.4,0.6,0.9])
data = np.random.randn(10000, 4)
df = pd.DataFrame(data, columns=['a', 'b', 'c', 'd'])
time1 = time.time()
pandas_quantiles = df.quantile(q, axis=1)
time2 = time.time()
print 'Pandas took %0.3f ms' % ((time2-time1)*1000.0)

time1 = time.time()
numpy_quantiles = np.percentile(data, q*100, axis=1)
time2 = time.time()
print 'Numpy took %0.3f ms' % ((time2-time1)*1000.0)

print (pandas_quantiles.values == numpy_quantiles).all()
# Output:
# Pandas took 3.415 ms
# Numpy took 2.040 ms
# True

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接