我有一组数据框,每个数据框有4列和1,000,000行。对于每一行,我想运行一个超几何检验,以这些列的4个值作为输入,并返回一个p值(使用超几何分布的累积概率密度函数)。
我尝试了两种基于SciPy的实现(如下),但都无法很好地扩展。是否有其他方法可以更有效地实现我所做的事情?我有一个在R中编写的工作解决方案(在底部),但不幸的是,代码必须用Python编写,因为它将用于从Postgres DB加载数据的Airflow任务,目前没有适用于R的Postgres hook。
较慢的SciPy实现
样本数据是这样创建的(使用10,000行而不是完整的52 * 1,000,000行):
import numpy as np
import pandas as pd
from scipy.stats import hypergeom
from timeit import default_timer as timer
n_rows = 10000
n_total = 1000
max_good = 400
max_sample = 200
s = 100
df = pd.DataFrame({
'ngood': np.random.hypergeometric(ngood=max_good, nbad=n_total - max_good,
nsample=s, size=n_rows),
'nsamp': np.random.hypergeometric(ngood=max_sample, nbad=n_total - max_sample,
nsample=s, size=n_rows)
})
df = df.assign(kgood=np.array([
np.random.hypergeometric(ngood=ngood, nbad=n_total - ngood,
nsample=nsamp, size=1)
for ngood, nsamp
in zip(df.ngood, df.nsamp)
]))
基于for推导式的慢速实现:
start = timer()
res = [
hypergeom.cdf(k=ngood_found, M=n_total, n=ngood, N=nsamp)
for ngood_found, ngood, nsamp
in zip(df.kgood, df.ngood, df.nsamp)
]
end = timer()
print(res[0:10])
print("Elapsed time: %fs" % (end - start))
[0.44247900002512713, 0.71587318053768023, 0.97215178135616498]
Elapsed time: 2.405838s
基于NumPy向量化的缓慢实现:
vectorized_test = np.vectorize(hypergeom.cdf, otypes=[np.float], excluded='M')
start = timer()
res = vectorized_test(k=df.kgood.values, M=n_total,
n=df.ngood.values, N=df.nsamp.values)
end = timer()
print(res[0:10])
print("Elapsed time: %fs" % (end - start))
[ 0.442479 0.71587318 0.97215178]
Elapsed time: 2.518952s
快速的R实现
以上计算可以在毫秒内完成,这表明phyper
在C级别上进行了向量化处理,而不是像我所知道的那样基本上是Python循环的numpy向量化。
library(microbenchmark)
n_rows <- 10000
n_total <- 1000
max_good <- 400
max_sample <- 200
s <- 100
df <- data.frame(
ngood = rhyper(nn=n_rows, m=max_good, n=n_total - max_good, k=s),
nsamp = rhyper(nn=n_rows, m=max_sample, n=n_total - max_sample, k=s)
)
df$kgood <- rhyper(nn=n_rows, m=df$ngood, n=n_total - df$ngood, k=df$nsamp)
microbenchmark(
res <- phyper(q = df$k, m = df$ngood, n = n_total - df$ngood, k=df$nsamp)
)
Unit: milliseconds
expr
phyper(q = df$k, m = df$ngood, n = n_total - df$ngood, k = df$nsamp)
min lq mean median uq max neval
2.984852 3.00838 3.350509 3.134745 3.439138 5.462694 100