Python：并行运行嵌套循环，2D移动窗口

Question

Python：并行运行嵌套循环，2D移动窗口

python-3.xperformancenumpyparallel-processingnested-loops

4

我处理地形数据。对于一个特定的问题，我用Python编写了一个函数，它使用一个特定大小的移动窗口来遍历矩阵（高程图）。然后我必须对这个窗口执行分析，并将窗口中心的单元格设置为一个结果值。

我的最终输出是一个与原始矩阵相同大小的矩阵，根据我的分析进行了改变。这个问题需要11小时才能在一个小区域内运行，所以我认为并行化内循环会加速运行。或者，可能还有一种聪明的向量化解决方案...

请见下面的函数，DEM是一个2D numpy数组，w是窗口的大小。

def RMSH_det(DEM, w):
    import numpy as np
    from scipy import signal
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

#    nw=(w*2)**2
#    x = np.arange(0,nw)

    for i in np.arange(w+1,nrows-w):


        for j in np.arange(w+1,ncols-w):

            d1 = np.int64(np.arange(i-w,i+w))
            d2 = np.int64(np.arange(j-w,j+w))

            win = DEM[d1[0]:d1[-1],d2[0]:d2[-1]]

            if np.max(np.isnan(win)) == 1:
                rms[i,j] = np.nan

            else:
                win = signal.detrend(win, type = 'linear')
                z = np.reshape(win,-1)
                nz = np.size(z)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
                rms[i,j] = rootms


    return(rms)

我在SO/SE上寻找解决问题的方案，发现了许多嵌套for循环并尝试并行运行它们的例子。我一直在努力调整我的代码以匹配这些示例，并希望得到一些帮助。解决此问题的方法将有助于我处理其他几个移动窗口函数。

到目前为止，我已经将内部循环移动到了自己的函数中，可以从外部循环中调用：

def inLoop(i, w, DEM,rms,ncols):
        for j in np.arange(w+1,ncols-w):

            d1 = np.int64(np.arange(i-w,i+w))
            d2 = np.int64(np.arange(j-w,j+w))

            win = DEM[d1[0]:d1[-1],d2[0]:d2[-1]]

            if np.max(np.isnan(win)) == 1:
                rms[i,j] = np.nan

            else:
                win = signal.detrend(win, type = 'linear')
                z = np.reshape(win,-1)
                nz = np.size(z)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
                rms[i,j] = rootms


        return(rms)

但我不确定如何正确编码调用带有必须输入内部循环的变量的池。请参见以下外部循环：

 for i in np.arange(w+1,nrows-w):
        number_of_workers = 8

        with Pool(number_of_workers) as p:
            #call the pool
            p.starmap(inLoop, [i, w, DEM, rms, ncols])

剩余的问题：

这段代码是否可以通过并行化进行优化？
如何成功地存储并行化嵌套循环的结果？

- morrismc

你能否具体说明一下（无法在此处发布MCVE -- 不可重现）？你的用例明确值是什么，如 [nrows, ncols]，dem.dtype，dem.flags 和 **w**，你的目标运行时间是多少小时，你的解决方案设计是什么？ - user3666197

1

[nrows, ncols] = 1355, 1165 dem.dtype = float32 dem.flags = C_CONTIGUOUS: True F_CONTIGUOUS: False OWNDATA: True WRITEABLE: True ALIGNED: True WRITEBACKIFCOPY: False UPDATEIFCOPY: False w = 10目前运行需要11小时，我希望在六小时内或更快地看到结果。 - morrismc

是否允许使用Numba解决方案？只需进行一些重写即可在不到10秒的时间内完成。 - max9111

@max9111，您能否检查下面代码中有关性能提示的注释，并调整您提出的代码，以免通过赋值“穿过”np.view()（仅限）破坏DEM中的原始数据，并且在此纠正后 - 您是否介意在更新的DEM比例上进行测试？ 13k1 x 11k4 x 4B float32 dtype？感谢您的重新考虑和numba编译的scipy.signal.detrend()技巧。 非常感谢您的付出，先生。 - user3666197

这个昨天已经被纠正了。我还添加了一个使用正规方程的更快但不太精确的解决方案。 - max9111

显示剩余2条评论

2个回答

1

问题 : 这个问题在一个小区域上运行需要11小时，... _{请持续关注，我们可以做到并且会在20分钟内解决!!}

鉴于提供了必要的解释，我感谢原始帖子的作者：

# DEM.shape = [nrows, ncols] = [ 1355, 1165 ]
# DEM.dtype = float32 
#    .flags = C_CONTIGUOUS    : True
#             F_CONTIGUOUS    : False
#             OWNDATA         : True
#             WRITEABLE       : True
#             ALIGNED         : True
#             WRITEBACKIFCOPY : False
#             UPDATEIFCOPY    : False

我尝试审查代码并设置一个更高效的模型，然后再加入流行且易于使用的numpy + numba工具，目前使用仅限于numpy的结果在样本为[100,100] DEM网格上工作，所需时间约为~ 6 [s]，核窗口宽度为w = 10。

同样，对于[200,200] DEM网格，所需时间不到~ 36 [s]，显然，缩放为~ O(N^2)。

对于[1000,1000] DEM网格，所需时间不到~ 1077 [s] ~ 17.6 [min]，哇！

正在进行关于[1000,1000] DEM网格的.jit测试，测试完成后将更新此帖子，并且一旦numba.jit()代码享受进一步加速的结果，将会更新。

到目前为止，还是相当有前途的，是吧？

如果你现在 @morrismc 在一个 [100,100] 的矩阵上测试你的原始代码，我们已经可以猜测在运行测试完成之前所达到的主要加速比范围。

>>> pass;    import numpy as np
>>> from zmq import Stopwatch; clk = Stopwatch()
>>>
>>> size =  100; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
      6492192 [us]
NumOf_np.nan-s was 0

>>> size =  200; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
     35650629 [us]
NumOf_np.nan-s was 0

>>> size = 1000; demF32 = np.random.random( ( size, size ) ).astype( np.float32 ); resF32 = demF32.copy(); clk.start(); _ = RMSH_det( demF32, 10, resF32 ); t = clk.stop(); print( "{1:>13d} [us]\nNumOf_np.nan-s was {0:d}".format( _, t ) )
   1058702889 [us]
NumOf_np.nan-s was 0

所有这些都在 scipy 1.2.1 上完成，因此没有从 1.3.1 可能的进一步加速中获益。

`numba.jit()` LLVM编译代码。哎呀，速度变慢了？

numba.jit()加速在[100,100] DEM网格上显示出比指定签名（因此没有产生任何临时分析成本）和nogil = True（“0.43.1+0.g8dabe7abe.dirty”不是最新的）更差的运行时间，大约慢了200 [ms]。

猜测这里没有更多可获得的收益，除非将游戏移入已编译的Cython领域，但只有几十分钟而不是几十小时，Alea Iacta Est-仅使用numpy智能向量化代码！

结论：

如果原始算法是正确的（源代码中留下了一些疑问，以便进行进一步的改进工作），任何尝试运行其他形式的[PARALLEL]代码执行流都不会有所帮助（内核窗口[w，w]是DEM网格内存布局的非常小且不连续的区域，内存I/O成本在这里占主导地位，一些更好的索引可能会提高缓存线重用率，但整体工作量远远超出预算，因为从~ 11 [hrs]到~ 6 [hrs]的目标已经超过成功达成，~ 20 [min]可用于[1300,1100] float32 DEM网格的运行时间。

由于在[DOC.me]，[TEST.me]和[PERF.me] QA阶段的所有附加教学价值，代码被保留为原样（非PEP-8），因此所有PEP主义者都要忍受O / P作者对全屏宽度布局的看法，以便了解为什么并改进代码，如果剥离注释，则会失去前进改进代码性能的方向。谢谢。

@jit( [ "int32( float32[:,:], int32, float32[:,:] )", ], nogil    = True )                  # numba.__version__ '0.43.1+0.g8dabe7abe.dirty'
def RMSH_det_jit( DEMf32, w, rmsRESULTf32 ):                            # pre-allocate rmsRESULTf32[:,:] externally
    #import numpy as np
    #from scipy import signal
    #
    # [nrows, ncols] = np.shape( DEM )                                  # avoid ~ [ 1355, 1165 ]
    #                                                                   # DEM.dtype = float32 
    #                                                                   #    .flags = C_CONTIGUOUS    : True
    #                                                                   #             F_CONTIGUOUS    : False
    #                                                                   #             OWNDATA         : True
    #                                                                   #             WRITEABLE       : True
    #                                                                   #             ALIGNED         : True
    #                                                                   #             WRITEBACKIFCOPY : False
    #                                                                   #             UPDATEIFCOPY    : False
    #
    rmsRESULTf32[:,:] = np.nan                                          #        .STO[:,:] np.nan-s, using in-place assignment into the by-ref passed, externally pre-allocated np.ndarray
    dtdWIN            = np.ones( ( 2 * w - 1,                           #        .ALLOC once, re-use 1M+ times
                                   2 * w - 1 ) )
    a_div_by_nz_minus1 = 1. / ( dtdWIN.size - 1  )                      #        .SET float CONST with about a ~1M+ re-use
    a_num_of_NaNs      = 0                                              #        .SET i4 bonus value, ret'd as a side-effect of the signature ... 
    # rms = DEM*np.nan                                                  # avoid ( pre-alloc rmsRESULTf32 ) externally create and pass a right-sized, empty array to store all results
    # nw  = ( w * 2 )**2
    # x   = np.arange( 0, nw )

    #                        11..1344
    #or     i in np.arange( w+1,           nrows-w ):                   # w ~ 10 -> [11:1344, 11:1154]
    for     i in np.arange( w+1, DEMf32.shape[0]-w ):                   #         ??? never touches DEM-row/column[0]?? or off-by-one indexing error ???
        fromI = i - w                                                   #        .UPD ALAP
        tillI = i + w - 1                                               #        .UPD ALAP upper bound index excluded ( this is how a code in [ np.arange(...)[0]:np.arange(...)[-1] ] works )
        #                    11..1154
        #or j in np.arange( w+1,           ncols-w ):
        for j in np.arange( w+1, DEMf32.shape[1]-w ):
            fromJ = j - w                                               #        .UPD ALAP
            tillJ = j + w - 1                                           #        .UPD ALAP upper bound index excluded ( this is how a code in [ np.arange(...)[0]:np.arange(...)[-1] ] works )
            #                       1..1334:21..1354                    #         ??? never touches first/last DEM-row/column??
            # d1 = np.int64( np.arange( i-w, i+w ) )                    # AVOID: 1M+ times allocated, yet never consumed, but their edge values
            # d2 = np.int64( np.arange( j-w, j+w ) )                    # AVOID: 1M+ times allocated, yet never consumed, but their edge values

            # win = DEM[ d1[0]:d1[-1],                                  # AVOID: while a .view-only, no need to 1M+ times instantiate a "kernel"-win(dow] ( this will create a np.view into the original DEM, not a copy ! )
            #            d2[0]:d2[-1]                                   # ?.or.?   NOT a .view-only, but a new .copy() instantiated, so as to call .detrend() w/o in-place modifying DEMf32 ???
            #            ]                                              # ?.or.?   NOT a .view-only, but a new .copy() instantiated, so as to call .detrend() w/o in-place modifying DEMf32 ???
            dtdWIN[:,:] = DEMf32[fromI:tillI, fromJ:tillJ]              #          NOT a .view-only, but a     .copy() re-populated into a just once and only once pre-allocated dtdWIN, via an in-place copy
            #f np.max( np.isnan(    win ) ) == 1:                       # AVOID: 1M+ times full-range scan, while any first np.nan decides the game and no need to scan "the rest"
            if np.any( np.isnan( dtdWIN ) ):                            #        "density" of np.nan-s determine, if this is a good idea to pre-store
               a_num_of_NaNs += 1                                       # .INC
               continue                                                 #        .NOP/LOOP from here, already pre-stored np.nan-s for this case
               # rms[i,j] = np.nan                                      # DUP ( already stored in initialisation ... )
            else:
               #in    = signal.detrend(    win, type = 'linear' )       # REALLY?: in-place modification of DEM-matrix ???
               dtdWIN = signal.detrend( dtdWIN, type = 'linear'   )     #    in scipy-v1.3.1+ can mod in-place,   overwrite_data = True ) # REMOVE OLS-fit-linear trend
               dtdWIN = signal.detrend( dtdWIN, type = 'constant' )     #    in scipy-v1.3.1+ can mod in-place,   overwrite_data = True ) # REMOVE mean
               #z  = np.reshape( win, -1 )                              # AVOID:~1M+ re-counting constant value, known from w directly
               #nz = np.size( z )                                       # AVOID:~1M+ re-counting constant value, known from w directly
               #rootms    = np.sqrt( 1 / ( nz - 1 ) * np.sum( ( z - np.mean( z ) )**2 ) )
               #rms[i,j]  = rootms
               rmsRESULTf32[i,j] = np.sqrt( a_div_by_nz_minus1          # .STO a "scaled"
                                          * np.dot(   dtdWIN,
                                                      dtdWIN.T
                                                      ).sum()
                                          # np.sum( ( dtdWIN            #         SUM of
                                          #       # - dtdWIN.mean()     #               mean-removed ( ALREADY done via scipy.signal.detrend( 'const' ) above )
                                          #           )**2              #               SQUARES
                                          #         )
                                            )                           #      ROOT
    return( a_num_of_NaNs )                                             # ret i4

- user3666197

我刚刚在一个随机的1000 x 1000网格上尝试了你的代码，函数运行速度显著提高（4.1分钟）。不确定为什么会加速，可能是由于我的原始矩阵的形状。此外，完整的网格大小为13078 x 11437，花费了11个小时。 - morrismc

好的，13k1 x 11k4 x 4B 开始进入分治区域以实现更快的处理和使用编译代码 - Cython + OpenMP 工具的组合将变得方便，但缓存行一致性增长会增加其影响力（或负面影响，如果在缓存行重用不良对齐排序中工作）。通过分裂瓷砖和 8 个 CPU 核心，比目标性能更好肯定是可以实现的，但 CPU/RAM 内存 I/O 通道代表了主要瓶颈。 - user3666197

所以我们回到了关于并行化以及如何通过现有代码或修改现有代码来实现这一目标的讨论。对此的任何见解都是受欢迎的。 - morrismc

@morrismc 参考重新编译 .detrend() 替代方案的建议，这将进一步加快所提出的 'numba.jit' 方式。如果需要更高级别的 OpenMP 并行性和 Cython 代码，则可以将 DEM 拆分为瓦片，以有效地利用所有 CPU 核心（请记住，RAM / CPU 内存 I/O 瓶颈仍将继续控制性能上限），而不是单线程、GIL 锁步执行（是的，纯序列，重新[串行化]代码执行）。有关 OpenMP / Cython 的详细信息，请参见 https://docs.cython.org/en/latest/src/userguide/parallelism.html#compiling - user3666197

我添加了一个更快但不太精确的算法，现在(13k,11k)的示例运行大约需要100秒。 - max9111

显示剩余3条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- max9111 · Accepted Answer

使用 Numba 的解决方案

在某些情况下，如果您使用的所有函数都受支持，则很容易实现。在您的代码中，win = signal.detrend(win, type='linear') 是您需要在 Numba 中实现的部分，因为此函数不受支持。

在 Numba 中实现 detrend

如果您查看 detrend 的源代码，并提取与问题相关的部分，代码可能如下所示：

@nb.njit()
def detrend(w):
    Npts=w.shape[0]
    A=np.empty((Npts,2),dtype=w.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    coef, resids, rank, s = np.linalg.lstsq(A, w.T)
    out=w.T- np.dot(A, coef)
    return out.T

我还为np.max(np.isnan(win)) == 1实现了一个更快的解决方案。

@nb.njit()
def isnan(win):
    for i in range(win.shape[0]):
        for j in range(win.shape[1]):
            if np.isnan(win[i,j]):
                return True
    return False

主要功能

由于我在这里使用了Numba，因此并行化非常简单，只需在外层循环上加上prange即可。

import numpy as np
import numba as nb

@nb.njit(parallel=True)
def RMSH_det_nb(DEM, w):
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

    for i in nb.prange(w+1,nrows-w):
        for j in range(w+1,ncols-w):
            win = DEM[i-w:i+w-1,j-w:j+w-1]

            if isnan(win):
                rms[i,j] = np.nan
            else:
                win = detrend(win)
                z = win.flatten()
                nz = z.size
                rootms = np.sqrt(1 / (nz - 1) * np.sum((z-np.mean(z))**2))
                rms[i,j] = rootms

    return rms

时序（小例子）

w = 10
DEM=np.random.rand(100, 100).astype(np.float32)

res1=RMSH_det(DEM, w)
res2=RMSH_det_nb(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True

%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb(DEM, w) #approx. 55 times faster
#29 ms ± 1.85 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

大数组的时间计算

w = 10
DEM=np.random.rand(1355, 1165).astype(np.float32)
%timeit res2=RMSH_det_nb(DEM, w)
#6.63 s ± 21.1 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

[编辑] 使用正规方程实现

超定系统

这种方法的数值精度较低，但解决方案速度要快得多。

@nb.njit()
def isnan(win):
    for i in range(win.shape[0]):
        for j in range(win.shape[1]):
            if win[i,j]==np.nan:
                return True
    return False

@nb.njit()
def detrend(w):
    Npts=w.shape[0]
    A=np.empty((Npts,2),dtype=w.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    coef, resids, rank, s = np.linalg.lstsq(A, w.T)
    out=w.T- np.dot(A, coef)
    return out.T

@nb.njit()
def detrend_2(w,T1,A):
    T2=np.dot(A.T,w.T)
    coef=np.linalg.solve(T1,T2)

    out=w.T- np.dot(A, coef)

    return out.T

@nb.njit(parallel=True)
def RMSH_det_nb_normal_eq(DEM,w):
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

    Npts=w*2-1
    A=np.empty((Npts,2),dtype=DEM.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    T1=np.dot(A.T,A)

    nz = Npts**2
    for i in nb.prange(w+1,nrows-w):
        for j in range(w+1,ncols-w):
            win = DEM[i-w:i+w-1,j-w:j+w-1]

            if isnan(win):
                rms[i,j] = np.nan
            else:
                win = detrend_2(win,T1,A)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((win-np.mean(win))**2))
                rms[i,j] = rootms

    return rms

时间

w = 10
DEM=np.random.rand(100, 100).astype(np.float32)

res1=RMSH_det(DEM, w)
res2=RMSH_det_nb(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True

%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb_normal_eq(DEM,w)
#7.97 ms ± 89.4 µs per loop (mean ± std. dev. of 7 runs, 1 loop each)

使用正规方程的优化解决方案

为了避免昂贵的内存分配，会重复使用临时数组，并使用自定义矩阵乘法实现。这仅适用于非常小的矩阵，对于大多数情况来说，np.dot（sgeemm）将更快。

@nb.njit()
def matmult_2(A,B,out):
    for j in range(B.shape[1]):
        acc1=nb.float32(0)
        acc2=nb.float32(0)
        for k in range(B.shape[0]):
            acc1+=A[0,k]*B[k,j]
            acc2+=A[1,k]*B[k,j]
        out[0,j]=acc1
        out[1,j]=acc2
    return out

@nb.njit(fastmath=True)
def matmult_mod(A,B,w,out):
    for j in range(B.shape[1]):
        for i in range(A.shape[0]):
            acc=nb.float32(0)
            acc+=A[i,0]*B[0,j]+A[i,1]*B[1,j]
            out[j,i]=acc-w[j,i]
    return out

@nb.njit()
def detrend_2_opt(w,T1,A,Tempvar_1,Tempvar_2):
    T2=matmult_2(A.T,w.T,Tempvar_1)
    coef=np.linalg.solve(T1,T2)
    return matmult_mod(A, coef,w,Tempvar_2)

@nb.njit(parallel=True)
def RMSH_det_nb_normal_eq_opt(DEM,w):
    [nrows, ncols] = np.shape(DEM)

    #create an empty array to store result
    rms = DEM*np.nan

    Npts=w*2-1
    A=np.empty((Npts,2),dtype=DEM.dtype)
    for i in range(Npts):
        A[i,0]=1.*(i+1) / Npts
        A[i,1]=1.

    T1=np.dot(A.T,A)

    nz = Npts**2
    for i in nb.prange(w+1,nrows-w):
        Tempvar_1=np.empty((2,Npts),dtype=DEM.dtype)
        Tempvar_2=np.empty((Npts,Npts),dtype=DEM.dtype)
        for j in range(w+1,ncols-w):
            win = DEM[i-w:i+w-1,j-w:j+w-1]

            if isnan(win):
                rms[i,j] = np.nan
            else:
                win = detrend_2_opt(win,T1,A,Tempvar_1,Tempvar_2)
                rootms = np.sqrt(1 / (nz - 1) * np.sum((win-np.mean(win))**2))
                rms[i,j] = rootms

    return rms

时间

w = 10
DEM=np.random.rand(100, 100).astype(np.float32)

res1=RMSH_det(DEM, w)
res2=RMSH_det_nb_normal_eq_opt(DEM, w)
print(np.allclose(res1,res2,equal_nan=True))
#True

%timeit res1=RMSH_det(DEM, w)
#1.59 s ± 72 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
%timeit res2=RMSH_det_nb_normal_eq_opt(DEM,w)
#4.66 ms ± 87.2 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

isnan函数的使用时机

该函数是另一种完全不同的实现。如果NaN在数组的开头，它会更快，但即使不在，也能有一些加速。我用小数组(大约窗口大小)和@user3666197建议的大尺寸进行了基准测试。

case_1=np.full((20,20),np.nan)
case_2=np.full((20,20),0.)
case_2[10,10]=np.nan
case_3=np.full((20,20),0.)

case_4 = np.full( ( int( 1E4 ), int( 1E4 ) ),np.nan)
case_5 = np.ones( ( int( 1E4 ), int( 1E4 ) ) )

%timeit np.any(np.isnan(case_1))
%timeit np.any(np.isnan(case_2))
%timeit np.any(np.isnan(case_3))
%timeit np.any(np.isnan(case_4))
%timeit np.any(np.isnan(case_5))
#2.75 µs ± 73.1 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#2.75 µs ± 46.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#2.76 µs ± 32.9 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)
#81.3 ms ± 2.97 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
#86.7 ms ± 2.26 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

%timeit isnan(case_1)
%timeit isnan(case_2)
%timeit isnan(case_3)
%timeit isnan(case_4)
%timeit isnan(case_5)
#244 ns ± 5.02 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#357 ns ± 1.07 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#475 ns ± 9.28 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#235 ns ± 0.933 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)
#58.8 ms ± 2.08 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

Python：并行运行嵌套循环，2D移动窗口

使用 Numba 的解决方案

到目前为止，还是相当有前途的，是吧？

numba.jit() LLVM编译代码。哎呀，速度变慢了？

结论：

`numba.jit()` LLVM编译代码。哎呀，速度变慢了？