Numba函数比C++慢，循环重新排序会使其减速10倍。

Question

Numba函数比C++慢，循环重新排序会使其减速10倍。

pythonperformancenumba

3

以下代码模拟从一组图像中不同位置提取二进制词。

下面的代码中，Numba包装的函数wordcalc有2个问题：

1.与类似的C++实现相比，速度慢了3倍。

2.最奇怪的是，如果您交换“ibase”和“ibit” for循环的顺序，则速度会降低10倍（！）。这在不受影响的C++实现中不会发生。

我正在使用来自WinPython 2.7的Numba 0.18.2

这是什么原因？

imDim = 80
numInsts = 10**4
numInstsSub = 10**4/4
bitsNum = 13;

Xs = np.random.rand(numInsts, imDim**2)       
iInstInds = np.array(range(numInsts)[::4])
baseInds = np.arange(imDim**2 - imDim*20 + 1)
ofst1 = np.random.randint(0, imDim*20, bitsNum)
ofst2 = np.random.randint(0, imDim*20, bitsNum)

@nb.jit(nopython=True)
def wordcalc(Xs, iInstInds, baseInds, ofst, bitsNum, newXz):
    count = 0
    for i in iInstInds:
        Xi = Xs[i]        
        for ibit in range(bitsNum):
            for ibase in range(baseInds.shape[0]):                    
                u = Xi[baseInds[ibase] + ofst[0, ibit]] > Xi[baseInds[ibase] + ofst[1, ibit]]
                newXz[count, ibase] = newXz[count, ibase] | np.uint16(u * (2**ibit))
        count += 1
    return newXz

ret = wordcalc(Xs, iInstInds, baseInds, np.array([ofst1, ofst2]), bitsNum, np.zeros((iInstInds.size, baseInds.size), dtype=np.uint16))

- Leo

我猜当循环顺序改变时，性能差异与缓存内存有关。 - lakshayg

@LakshayGarg 我也是这么想的，但 C++ 实现对此完全不敏感。 - Leo

很不可能，但也许编译器足够聪明，可以为您进行优化。您使用的是哪个编译器？ - lakshayg

@LakshayGarg Visual Studio 2012 - Leo

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- DavidW · Accepted Answer

通过将np.uint16(u * (2 ** ibit)) 替换为 np.uint16(u << ibit)，我成功地实现了4倍的加速效果。即用位移替代了2的幂，这在整数运算中是等价的。你的C++编译器也可能会自动进行这种替换。对于您原始版本（5％）和我的优化版本（15％），交换两个循环的顺序对我产生了一些影响，所以我无法对此发表有用的评论。如果你真的想比较Numba和C ++，你可以在导入Numba之前执行os.environ ['NUMBA_DUMP_ASSEMBLY'] ='1'以查看已编译的Numba函数。不过，这显然需要一定的技术。供参考，我使用的是Numba 0.19.1版本。