我尝试使用numba和prange并行化循环的简单代码。但是出现了一个问题,当我使用更多线程时,速度反而变慢。为什么会发生这种情况?(CPU Ryzen 7 2700x, 8个核心,16个线程,3.7GHz)
from numba import njit, prange,set_num_threads,get_num_threads
@njit(parallel=True,fastmath=True)
def test1():
x=np.empty((10,10))
for i in prange(10):
for j in range(10):
x[i,j]=i+j
Number of threads : 1
897 ns ± 18.3 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 2
1.68 µs ± 262 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 3
2.4 µs ± 163 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 4
4.12 µs ± 294 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 5
4.62 µs ± 283 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 6
5.01 µs ± 145 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 7
5.52 µs ± 194 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 8
4.85 µs ± 140 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 9
6.47 µs ± 348 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 10
6.88 µs ± 120 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 11
7.1 µs ± 154 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 12
7.47 µs ± 159 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 13
7.91 µs ± 160 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 14
9.04 µs ± 472 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 15
9.74 µs ± 581 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
Number of threads : 16
11 µs ± 967 ns per loop (mean ± std. dev. of 10 runs, 100000 loops each)
x
,而并行化需要时间x/10+L
,其中L
是由于多线程引起的延迟,假定你至少有 10 个核心(1 个核心上的 10 个线程在计算任务上不会比 1 个线程更快)。多线程延迟是指将工作分配给线程并等待它们完成的延迟。 - Jérôme Richard