为什么np.hypot和np.subtract.outer比原始广播快很多？使用Numba在并行计算中加速numpy用于距离矩阵计算

Question

为什么np.hypot和np.subtract.outer比原始广播快很多？使用Numba在并行计算中加速numpy用于距离矩阵计算

pythonnumpyvectorizationnumpy-ndarraydistance-matrix

6

我有两个大的2D点集，需要计算距离矩阵。我希望它快速且使用python实现，因此我使用了numpy。最近我学到了numpy广播，并使用了它，而不是在python中循环，numpy会在C中进行计算。

我一直认为广播就是我所需要的，直到我看到其他方法比香草广播更好用，我有两种计算距离矩阵的方式，但我不明白为什么其中一种比另一种更好。

我在这里查找了https://github.com/numpy/numpy/issues/14761，并得到了矛盾的结果。

以下是两种计算距离矩阵的方法：

3、4和8、9都可以计算距离矩阵，但使用subtract.outer的3+4要比使用香草广播的8快得多，使用hypot的6要比简单的9快得多。我没有在Python循环中尝试，因为我认为它永远无法完成。

我想知道：

1. 是否有更快的方法来计算距离矩阵（也许是scikit-learn或scipy）？

2. hypot和subtract.outer为什么如此快？

为方便起见，我还附上了运行整个过程的代码片段，并更改了seed以防止缓存重用。

### Cell 1
import numpy as np

np.random.seed(858442)

### Cell 2
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))

CPU times: user 2.02 ms, sys: 1.4 ms, total: 3.42 ms
Wall time: 1.84 ms

### Cell 3
%%time
d0 = np.subtract.outer(obs[:,0], interp[:,0])

CPU times: user 2.46 s, sys: 1.97 s, total: 4.42 s
Wall time: 4.42 s

### Cell 4
%%time
d1 = np.subtract.outer(obs[:,1], interp[:,1])

CPU times: user 3.1 s, sys: 2.7 s, total: 5.8 s
Wall time: 8.34 s

### Cell 5
%%time
h = np.hypot(d0, d1)

CPU times: user 12.7 s, sys: 24.6 s, total: 37.3 s
Wall time: 1min 6s

### Cell 6
np.random.seed(773228)

### Cell 7
%%time
obs = np.random.random((50000, 2))
interp = np.random.random((30000, 2))

CPU times: user 1.84 ms, sys: 1.56 ms, total: 3.4 ms
Wall time: 2.03 ms

### Cell 8
%%time
d = obs[:, np.newaxis, :] - interp
d0, d1 = d[:, :, 0], d[:, :, 1]

CPU times: user 22.7 s, sys: 8.24 s, total: 30.9 s
Wall time: 33.2 s

### Cell 9
%%time
h = np.sqrt(d0**2 + d1**2)

CPU times: user 29.1 s, sys: 2min 12s, total: 2min 41s
Wall time: 6min 10s

感谢Jérôme Richard 在这里的更新

Stackoverflow从未让人失望
使用numba有一种更快的方法。
Numba拥有即时编译器，它将把Python代码片段转换为快速的机器码。第一次使用它比后续使用要慢一点，因为需要编译。但是，甚至对于(49000,12000)矩阵，即使在第一次使用njit并行功能时，也比hypot和subtract.outer快9倍。

各种方法的性能表现

确保每次运行脚本时使用不同的种子(seed)

import sys
import time

import numba as nb
import numpy as np

np.random.seed(int(sys.argv[1]))

d0 = np.random.random((49000, 2))
d1 = np.random.random((12000, 2))

def f1(d0, d1):
    print('Numba without parallel')
    res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
    for i in nb.prange(d0.shape[0]):
        for j in range(d1.shape[0]):
            res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
    return res

# Add eager compilation, compiles before hand
@nb.njit((nb.float64[:, :], nb.float64[:, :]), parallel=True)
def f2(d0, d1):
    print('Numba with parallel')
    res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
    for i in nb.prange(d0.shape[0]):
        for j in range(d1.shape[0]):
            res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
    return res

def f3(d0, d1):
    print('hypot + subtract.outer')
    np.hypot(
        np.subtract.outer(d0[:,0], d1[:,0]),
        np.subtract.outer(d0[:,1], d1[:,1])
    )

if __name__ == '__main__':
    s1 = time.time()
    eval(f'{sys.argv[2]}(d0, d1)')
    print(time.time() - s1)

(base) ~/xx@xx:~/xx$ python3 test.py 523432 f3
hypot + subtract.outer
9.79756784439087
(base) xx@xx:~/xx$ python3 test.py 213622 f2
Numba with parallel
0.3393140316009521

我将在进一步的开发和如果我找到更快的方法时更新此帖子

- eroot163pi

2个回答

0

感谢Jérôme Richard 这里的更新

Stackoverflow从未让人失望
有一种更快的方法，使用numba
它具有即时编译器，可以将Python代码转换为快速的机器码，第一次使用时会比后续使用稍慢，因为它需要编译。但即使对于(49000, 12000)矩阵，第一次使用njit并行计算也比hypot + subtract.outer快9倍

各种方法的性能

确保每次运行脚本时使用不同的种子

import sys
import time

import numba as nb
import numpy as np

np.random.seed(int(sys.argv[1]))

d0 = np.random.random((49000, 2))
d1 = np.random.random((12000, 2))

def f1(d0, d1):
    print('Numba without parallel')
    res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
    for i in nb.prange(d0.shape[0]):
        for j in range(d1.shape[0]):
            res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
    return res

# Add eager compilation, compiles before hand
@nb.njit((nb.float64[:, :], nb.float64[:, :]), parallel=True)
def f2(d0, d1):
    print('Numba with parallel')
    res = np.empty((d0.shape[0], d1.shape[0]), dtype=d0.dtype)
    for i in nb.prange(d0.shape[0]):
        for j in range(d1.shape[0]):
            res[i, j] = np.sqrt((d0[i, 0] - d1[j, 0])**2 + (d0[i, 1] - d1[j, 1])**2)
    return res

def f3(d0, d1):
    print('hypot + subtract.outer')
    np.hypot(
        np.subtract.outer(d0[:,0], d1[:,0]),
        np.subtract.outer(d0[:,1], d1[:,1])
    )

if __name__ == '__main__':
    s1 = time.time()
    eval(f'{sys.argv[2]}(d0, d1)')
    print(time.time() - s1)

(base) ~/xx@xx:~/xx$ python3 test.py 523432 f3
hypot + subtract.outer
9.79756784439087
(base) xx@xx:~/xx$ python3 test.py 213622 f2
Numba with parallel
0.3393140316009521

我会在进一步的发展和如果我找到更快的方法时更新这篇文章。

- eroot163pi

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jérôme Richard · Accepted Answer

首先，d0和d1各占用了50000 x 30000 x 8 = 12 GB的空间，非常大。请确保您拥有超过100 GB的内存，因为整个脚本需要使用这么多！这是一个巨大的内存量。如果您没有足够的内存，操作系统将使用存储设备（例如交换分区）来存储多余的数据，这会明显变慢。实际上，Cell-4比Cell-3慢没有任何原因，我猜想您已经没有足够的内存来完全在RAM中存储d1，而d0似乎可以（基本）适应内存。当两者都可以放入RAM中时，在我的机器上没有区别（可以反转操作顺序以进行检查）。这也解释了为什么后续操作往往会变得更慢。

话虽如此，单元格8+9也会变慢，因为它们创建了临时数组，并且需要更多的内存传输来计算结果，比单元格3+4+5还要多。实际上，表达式np.sqrt(d0**2 + d1**2)首先在内存中计算d0**2，结果是一个新的12 GB临时数组，然后计算d1**2，又得到另一个12 GB的临时数组，接着进行这两个临时数组的求和，产生另一个新的12 GB临时数组，最后计算平方根，又得到12 GB临时数组。这可能需要48 GB的内存，并需要4次读写内存绑定的传递。这不是有效的方法，不能有效地使用CPU/RAM（例如CPU缓存）。

有一种更快的实现方式，可以使用Numba的JIT在并行处理中一次完成整个计算。以下是一个例子：

import numba as nb
@nb.njit(parallel=True)
def distanceMatrix(a, b):
    res = np.empty((a.shape[0], b.shape[0]), dtype=a.dtype)
    for i in nb.prange(a.shape[0]):
        for j in range(b.shape[0]):
            res[i, j] = np.sqrt((a[i, 0] - b[j, 0])**2 + (a[i, 1] - b[j, 1])**2)
    return res

这个实现方式只使用了三倍少的内存(仅12 GB)，而且比使用subtract.outer的实现方式要快得多。确实，由于交换，单元格3+4+5需要几分钟，而这个实现方式只需要1.3秒!

结论是，访问内存和临时数组都是代价高昂的。当处理大型缓冲区时，需要避免在内存中进行多次传递，并在执行的计算不是简单的情况下（例如使用数组块）利用CPU缓存。

为什么np.hypot和np.subtract.outer比原始广播快很多？使用Numba在并行计算中加速numpy用于距离矩阵计算

感谢Jérôme Richard在这里的更新

各种方法的性能表现

我将在进一步的开发和如果我找到更快的方法时更新此帖子

感谢Jérôme Richard 这里的更新

各种方法的性能

感谢Jérôme Richard 在这里的更新