Numba中的guvectorize目标参数为'parallel'比目标参数为'cpu'慢。

Question

Numba中的guvectorize目标参数为'parallel'比目标参数为'cpu'慢。

6

我一直在尝试优化一个涉及大型多维数组计算的Python代码。但是，我在使用numba时得到了令人费解的结果。我的电脑是2015年中期发布的MBP，2.5 GHz i7 四核处理器，运行OS 10.10.5操作系统和Python 2.7.11版本。请看下面的代码：

 import numpy as np
 from numba import jit, vectorize, guvectorize
 import numexpr as ne
 import timeit

 def add_two_2ds_naive(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 @jit
 def add_two_2ds_jit(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 @guvectorize(['float64[:,:],float64[:,:],float64[:,:]'],
    '(n,m),(n,m)->(n,m)',target='cpu')
 def add_two_2ds_cpu(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 @guvectorize(['(float64[:,:],float64[:,:],float64[:,:])'],
    '(n,m),(n,m)->(n,m)',target='parallel')
 def add_two_2ds_parallel(A,B,res):
     for i in range(A.shape[0]):
         for j in range(B.shape[1]):
             res[i,j] = A[i,j]+B[i,j]

 def add_two_2ds_numexpr(A,B,res):
     res = ne.evaluate('A+B')

 if __name__=="__main__":
     np.random.seed(69)
     A = np.random.rand(10000,100)
     B = np.random.rand(10000,100)
     res = np.zeros((10000,100))

我现在可以在各种函数上运行timeit：

%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.16 ms per loop

%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.19 ms per loop

%timeit add_two_2ds_parallel(A,B,res)
100 loops, best of 3: 6.9 ms per loop

%timeit add_two_2ds_numexpr(A,B,res)
1000 loops, best of 3: 1.62 ms per loop

似乎“并行”功能即使使用了单个核心的大部分资源，也无法生效，因为在 top 中显示 Python 对“并行”使用了约 40% 的 CPU，对“cpu”使用了约 100%，而 numexpr 使用了约 300%。

- Brian Pollack

但是 guvectorize 的重点在于，您定义的操作将应用于任何_额外_维度（这是将以并行方式完成的部分）。您编写的代码本身不会被并行化。因此，如果您将 A，B 和 res 更改为形状为 (10000,100,100)，则来自第三个维度的 100 个不同迭代将并行运行。 - DavidW

谢谢，我看到我误解了用法。 - Brian Pollack

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Stan Seibert · Accepted Answer

你的@guvectorize实现存在两个问题。第一个问题是你在@guvectorize内部进行了所有循环，因此实际上没有多余的内容可以用于Numba并行计算。@vectorize和@guvectorize都会在ufunc/gufunc中的广播维度上并行化。由于gufunc的签名是2D，而您的输入也是2D，因此只会有一次对内部函数的调用，这就解释了您看到的CPU使用率仅达到100%。

编写上述函数最佳方法是使用常规的ufunc：

@vectorize('(float64, float64)', target='parallel')
def add_ufunc(a, b):
    return a + b

然后在我的系统上，我看到以下速度：

%timeit add_two_2ds_jit(A,B,res)
1000 loops, best of 3: 1.87 ms per loop

%timeit add_two_2ds_cpu(A,B,res)
1000 loops, best of 3: 1.81 ms per loop

%timeit add_two_2ds_parallel(A,B,res)
The slowest run took 11.82 times longer than the fastest. This could mean that an intermediate result is being cached 
100 loops, best of 3: 2.43 ms per loop

%timeit add_two_2ds_numexpr(A,B,res)
100 loops, best of 3: 2.79 ms per loop

%timeit add_ufunc(A, B, res)
The slowest run took 9.24 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 2.03 ms per loop

这是一个与您的OS X系统非常相似的系统，但使用的是OS X 10.11。

尽管Numba的并行ufunc现在比numexpr更快（我看到add_ufunc使用了约280%的CPU），但它并没有打败简单的单线程CPU情况。我怀疑瓶颈是由于内存（或缓存）带宽，但我还没有进行测量来检查。

一般来说，如果每个内存元素执行更多的数学运算（比如余弦），则您将从并行ufunc目标中获得更多的好处。