Numba和guvectorize用于CUDA目标：代码运行速度比预期慢

Question

Numba和guvectorize用于CUDA目标：代码运行速度比预期慢

3

值得注意的细节：

大型数据集（10百万个 x 5），（200 x 10百万个 x 5）
大部分使用Numpy
每次运行后需要更长的时间
使用Spyder3
Windows 10操作系统

首先尝试使用以下函数和guvectorize。我正在传递一堆numpy数组，并试图使用它们在两个数组上执行乘法。如果在除cuda以外的目标上运行，则可以正常工作。然而，当切换到cuda时，会导致一个未知错误:

File "C:\ProgramData\Anaconda3\lib\site-packages\numba\cuda\decorators.py", >line 82, in jitwrapper debug=debug)

TypeError: init() got an unexpected keyword argument 'debug'

在查找此错误时，我遇到了很多死胡同。我猜想这应该是一个非常简单的修复程序，但我完全错过了。还应该注意，只有在运行一次并由于内存超载而崩溃后才会发生此错误。

os.environ["NUMBA_ENABLE_CUDASIM"] = "1"

os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...

所有的数组都是numpy

@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:], 
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)', 
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
    for as_of_date in range(0,ed):
        for ID in range(0,rowCount):
            for num in range(0,n):
                cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]

尝试在命令行中使用nvprofiler运行代码会导致以下错误：

警告：由于在此多GPU设置中检测到不支持互连的设备对，因此当前配置不支持统一内存分析。当对等映射不可用时，系统将回退到使用零拷贝内存。这可能导致访问统一内存的核心函数运行速度变慢。更多细节请参见：http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory 我意识到我正在使用启用SLI的显卡（两张卡是相同的，evga gtx 1080ti，并具有相同的设备id），因此我禁用了SLI并添加了"CUDA_VISIBLE_DEVICES"参数限制到另一个卡上，但结果相同。

我仍然可以通过nvprof运行代码，但与nijt（parallel=True）和prange相比，cuda函数速度较慢。通过使用较小的数据大小我们可以运行该代码，但它比target='parallel'和target='cpu'还要慢。

为什么cuda速度如此缓慢，这些错误是什么意思？

感谢您的帮助！

编辑：

以下是该代码的工作示例。

import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
    for as_of_date in range(0,countRow):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))

我能够使用gtx 1080ti在cuda上运行代码，但是与并行或cpu运行相比，速度慢得多。我查看了其他有关guvectorize的帖子，但它们都没有帮助我理解什么是最优的运行guvectorize以及什么不是。是否有任何方法使此代码“适用于cuda”，或者仅跨数组进行乘法过于简单，无法产生任何好处？

- Bryce Booze

2

我建议提供一个 [mcve]。 - Robert Crovella

那个 nvprof 输出是一个警告，而不是一个错误。 - talonmies

如果这是在GPU上唯一执行的操作（工作流程：将数据复制到GPU->计算->将结果复制回来），那么预计速度会较慢，因为内存复制所需时间比CPU上的计算时间更长。像这样的简单标量操作完全受到内存带宽的限制。 - max9111

2个回答

1

"

如果使用CUDA 8.0的Numba 0.38.1进行剖析，可以立即看出gufunc Numba发出和运行缓慢的原因。

"

==24691== Profiling application: python slowvec.py
==24691== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
271.33ms  1.2800us                    -               -         -         -         -        8B  5.9605MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
271.65ms  14.591us                    -               -         -         -         -  156.25KB  10.213GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
272.09ms  2.5868ms                    -               -         -         -         -  15.259MB  5.7605GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
274.98ms     992ns                    -               -         -         -         -        8B  7.6909MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
275.17ms     640ns                    -               -         -         -         -        8B  11.921MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
276.33ms  657.28ms              (1 1 1)        (64 1 1)        40        0B        0B         -           -  GeForce GTX 970         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38]
933.62ms  3.5128ms                    -               -         -         -         -  15.259MB  4.2419GB/s  GeForce GTX 970         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

生成的内核启动运行代码使用了一个64个线程的单个块。在一个GPU上，理论上每个MP最多可以有2048个线程，并且有23个MP，这意味着您的GPU的理论处理能力约有99.9%未被使用。这似乎是numba开发人员荒谬的设计选择，如果它妨碍了您（而且看起来确实如此），我将报告它为一个错误。

显然的解决方案是将您的函数重写为CUDA Python内核方言的@cuda.jit函数，并显式控制执行参数。这样，您至少可以确保代码将以足够的线程数运行，以潜在地使用硬件的所有容量。它仍然是一个非常内存绑定的操作，因此您可以在速度提升方面取得的成果可能会受到限制，甚至可能远远低于您GPU与CPU的内存带宽比率。即使这是最好的情况，也可能不足以摊销主机到设备内存传输的成本，因此可能没有性能收益。

简而言之，请注意自动化编译器生成的并行性的危险...

补充一下，我成功地找到了如何获取由numba发出的代码的PTX，并且可以说它绝对是糟糕的（而且太长了，我实际上无法发布所有内容）：

{
    .reg .pred  %p<9>;
    .reg .b32   %r<8>;
    .reg .f64   %fd<4>;
    .reg .b64   %rd<137>;


    ld.param.u64    %rd29, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_5];
    ld.param.u64    %rd31, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_11];
    ld.param.u64    %rd32, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
    ld.param.u64    %rd34, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_14];
    ld.param.u64    %rd35, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_15];
    ld.param.u64    %rd36, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_16];
    ld.param.u64    %rd37, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_17];
    ld.param.u64    %rd38, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_22];
    ld.param.u64    %rd39, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_23];
    ld.param.u64    %rd40, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_24];
    ld.param.u64    %rd41, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_25];
    ld.param.u64    %rd42, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_26];
    ld.param.u64    %rd43, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_27];
    ld.param.u64    %rd44, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_28];
    ld.param.u64    %rd45, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_29];
    ld.param.u64    %rd46, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_30];
    ld.param.u64    %rd48, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_36];
    ld.param.u64    %rd51, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_43];
    ld.param.u64    %rd53, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_49];
    ld.param.u64    %rd54, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_50];
    ld.param.u64    %rd55, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_51];
    ld.param.u64    %rd56, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_52];
    ld.param.u64    %rd57, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_53];
    ld.param.u64    %rd58, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_54];
    ld.param.u64    %rd59, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_55];
    ld.param.u64    %rd60, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_56];
    ld.param.u64    %rd61, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_57];
    mov.u32     %r1, %tid.x;
    mov.u32     %r3, %ctaid.x;
    mov.u32     %r2, %ntid.x;
    mad.lo.s32  %r4, %r3, %r2, %r1;
    min.s64     %rd62, %rd32, %rd29;
    min.s64     %rd63, %rd39, %rd62;
    min.s64     %rd64, %rd48, %rd63;
    min.s64     %rd65, %rd51, %rd64;
    min.s64     %rd66, %rd54, %rd65;
    cvt.s64.s32 %rd1, %r4;
    setp.le.s64 %p2, %rd66, %rd1;
    @%p2 bra    BB0_8;

    ld.param.u64    %rd126, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_42];
    ld.param.u64    %rd125, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_44];
    ld.param.u64    %rd124, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_35];
    ld.param.u64    %rd123, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_37];
    ld.param.u64    %rd122, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_4];
    ld.param.u64    %rd121, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_6];
    cvt.u32.u64 %r5, %rd1;
    setp.lt.s32 %p1, %r5, 0;
    selp.b64    %rd67, %rd29, 0, %p1;
    add.s64     %rd68, %rd67, %rd1;
    mul.lo.s64  %rd69, %rd68, %rd121;
    add.s64     %rd70, %rd69, %rd122;
    selp.b64    %rd71, %rd48, 0, %p1;
    add.s64     %rd72, %rd71, %rd1;
    mul.lo.s64  %rd73, %rd72, %rd123;
    add.s64     %rd74, %rd73, %rd124;
    ld.u64  %rd2, [%rd74];
    selp.b64    %rd75, %rd51, 0, %p1;
    add.s64     %rd76, %rd75, %rd1;
    mul.lo.s64  %rd77, %rd76, %rd125;
    add.s64     %rd78, %rd77, %rd126;
    ld.u64  %rd3, [%rd78];
    ld.u64  %rd4, [%rd70];
    setp.lt.s64 %p3, %rd4, 1;
    @%p3 bra    BB0_8;

    ld.param.u64    %rd128, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_13];
    ld.param.u64    %rd127, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
    selp.b64    %rd80, %rd127, 0, %p1;
    mov.u64     %rd79, 0;
    min.s64     %rd81, %rd128, %rd79;
    min.s64     %rd82, %rd34, %rd79;
    selp.b64    %rd83, %rd39, 0, %p1;
    min.s64     %rd84, %rd40, %rd79;
    min.s64     %rd85, %rd41, %rd79;
    min.s64     %rd86, %rd42, %rd79;
    selp.b64    %rd87, %rd54, 0, %p1;
    min.s64     %rd88, %rd55, %rd79;
    min.s64     %rd89, %rd56, %rd79;
    min.s64     %rd90, %rd57, %rd79;
    mul.lo.s64  %rd91, %rd90, %rd61;
    add.s64     %rd92, %rd53, %rd91;
    mul.lo.s64  %rd93, %rd89, %rd60;
    add.s64     %rd94, %rd92, %rd93;
    mul.lo.s64  %rd95, %rd88, %rd59;
    add.s64     %rd96, %rd94, %rd95;
    add.s64     %rd98, %rd87, %rd1;
    mul.lo.s64  %rd99, %rd58, %rd98;
    add.s64     %rd5, %rd96, %rd99;
    mul.lo.s64  %rd100, %rd86, %rd46;
    add.s64     %rd101, %rd38, %rd100;
    mul.lo.s64  %rd102, %rd85, %rd45;
    add.s64     %rd103, %rd101, %rd102;
    mul.lo.s64  %rd104, %rd84, %rd44;
    add.s64     %rd105, %rd103, %rd104;
    add.s64     %rd106, %rd83, %rd1;
    mul.lo.s64  %rd107, %rd43, %rd106;
    add.s64     %rd6, %rd105, %rd107;
    mul.lo.s64  %rd108, %rd82, %rd37;
    add.s64     %rd109, %rd31, %rd108;
    mul.lo.s64  %rd110, %rd81, %rd36;
    add.s64     %rd111, %rd109, %rd110;
    add.s64     %rd112, %rd80, %rd1;
    mul.lo.s64  %rd113, %rd35, %rd112;
    add.s64     %rd7, %rd111, %rd113;
    add.s64     %rd8, %rd2, 1;
    mov.u64     %rd131, %rd79;

BB0_3:
    mul.lo.s64  %rd115, %rd59, %rd131;
    add.s64     %rd10, %rd5, %rd115;
    mul.lo.s64  %rd116, %rd44, %rd131;
    add.s64     %rd11, %rd6, %rd116;
    setp.lt.s64 %p4, %rd3, 1;
    mov.u64     %rd130, %rd79;
    mov.u64     %rd132, %rd3;
    @%p4 bra    BB0_7;

BB0_4:
    mov.u64     %rd13, %rd132;
    mov.u64     %rd12, %rd130;
    mul.lo.s64  %rd117, %rd60, %rd12;
    add.s64     %rd136, %rd10, %rd117;
    mul.lo.s64  %rd118, %rd45, %rd12;
    add.s64     %rd135, %rd11, %rd118;
    mul.lo.s64  %rd119, %rd36, %rd12;
    add.s64     %rd134, %rd7, %rd119;
    setp.lt.s64 %p5, %rd2, 1;
    mov.u64     %rd133, %rd8;
    @%p5 bra    BB0_6;

BB0_5:
    mov.u64     %rd17, %rd133;
    ld.f64  %fd1, [%rd135];
    ld.f64  %fd2, [%rd134];
    mul.f64     %fd3, %fd2, %fd1;
    st.f64  [%rd136], %fd3;
    add.s64     %rd136, %rd136, %rd61;
    add.s64     %rd135, %rd135, %rd46;
    add.s64     %rd134, %rd134, %rd37;
    add.s64     %rd24, %rd17, -1;
    setp.gt.s64 %p6, %rd24, 1;
    mov.u64     %rd133, %rd24;
    @%p6 bra    BB0_5;

BB0_6:
    add.s64     %rd25, %rd13, -1;
    add.s64     %rd26, %rd12, 1;
    setp.gt.s64 %p7, %rd13, 1;
    mov.u64     %rd130, %rd26;
    mov.u64     %rd132, %rd25;
    @%p7 bra    BB0_4;

BB0_7:
    sub.s64     %rd120, %rd4, %rd131;
    add.s64     %rd131, %rd131, 1;
    setp.gt.s64 %p8, %rd120, 1;
    @%p8 bra    BB0_3;

BB0_8:
    ret;
}

所有这些整数操作只为执行一次双精度乘法！

- talonmies

谢谢你的回复，我已经尝试使用@cuda.jit签名，但是线程块数出现了问题。你在第一个性能分析示例中是如何获取网格大小和块大小的？再次感谢您的快速回复。 - Bryce Booze

@BryceB 如果你使用 --print-gpu-trace 选项运行 nvprof，你会看到相同的输出。但是如果你编写自己的 @cuda.jit 内核，你需要负责设置网格和块大小。 - talonmies

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Robert Crovella · Accepted Answer

首先，您展示的基本操作是将两个矩阵传输到GPU，执行一些逐元素乘法以产生第三个数组，并将该第三个数组传回主机。可能可以实现numba/cuda guvectorize（或cuda.jit kernel）实现，它可能比朴素的串行python实现运行得更快，但我怀疑无法超过编写良好的主机代码（例如使用某些并行化方法，如guvectorize）来完成相同的操作的性能。这是因为每个字节在主机和设备之间传输的算术强度太低。这个操作太简单了。

其次，我认为重要的是要先了解numba中vectorize和guvectorize的用途。基本原则是从“工作人员将做什么？”的角度编写ufunc定义，然后允许numba从中启动多个工作人员。指导numba启动多个工作人员的方法是传递一个大于您给出的签名的数据集。需要注意的是，numba不知道如何并行化ufunc定义内部的for循环。它通过将您的ufunc定义在并行工作人员之间运行，并在每个工作人员处理“切片”数据时运行整个ufunc定义来获得并行“强度”。如果需要更多阅读，我也在这里涵盖了一些相关内容。

在您的实现中存在一个问题，即您编写了一个将整个输入数据集映射到单个工作程序（和ufunc）的签名。正如@talonmies所示，您的基础内核正在使用总共64个线程/工作程序（这对于GPU来说太小而不足为奇，即使与上述算术强度无关）。但是我怀疑事实上64实际上只是numba最小线程块大小，而实际上只有1个线程在该线程块中执行任何有用的计算工作。这一个线程以串行方式执行您的整个ufunc，包括所有for循环。

显然，这不是任何人对vectorize或guvectorize合理使用的意图。

让我们回顾一下你要做什么。最终，您的ufunc想要将一个数组中的输入值乘以另一个数组中的输入值，并将结果存储在第三个数组中。我们希望重复执行这个过程很多次。如果所有3个数组大小相同，我们实际上可以使用vectorize来实现这一点，甚至不必诉诸更复杂的guvectorize。让我们比较这种方法和您的原始方法，着重于CUDA内核执行。这是一个完整的示例，其中t14.py是您的原始代码，使用分析器运行，而t15.py是它的vectorize版本，承认我们已经改变了multBy数组的大小来匹配cv和discount:

$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
312.36ms  1.2160us                    -               -         -         -         -        8B  6.2742MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.81ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.52ms  5.8696ms                    -               -         -         -         -  15.259MB  2.5387GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.74ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.93ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
321.40ms  1.22538s              (1 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s  7.1816ms                    -               -         -         -         -  15.259MB  2.0749GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
193.92ms  6.2729ms                    -               -         -         -         -  15.259MB  2.3755GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
201.09ms  5.7101ms                    -               -         -         -         -  15.259MB  2.6096GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
364.92ms  842.49us          (15625 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms  7.1528ms                    -               -         -         -         -  15.259MB  2.0833GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

我们看到你的应用程序报告了大约1.244秒的运行时间，而向量化版本报告了大约0.375秒的运行时间。但这两个数字都包含Python开销。如果我们查看分析器中生成的CUDA内核持续时间，差异甚至更加明显。我们看到原始内核花费了大约1.225秒，而向量化内核在大约842微秒（即不到1毫秒）内执行。我们还注意到，计算内核时间现在比将3个数组传输到/从GPU所需的时间要小得多（总共需要约20毫秒），并且我们注意到内核尺寸现在为15625个128线程块，每个块有一个总线程/工作数为2000000，正好匹配要执行的总乘法操作数量，并且比原始代码中可怜的64个线程（可能只有1个线程）要多得多。

考虑到上述“向量化”方法的简单性，如果您真正想做的是逐元素乘法，则可以考虑复制“multBy”，使其在尺寸上与其他两个数组匹配，并完成它。

但问题仍然存在：如何处理不同的输入数组大小，就像原始问题一样？为此，我认为我们需要使用guvectorize（或者，正如@talonmies所指出的那样，编写自己的@cuda.jit内核，这可能是最好的建议，尽管可能没有这些方法中的任何一种能够克服数据传输到/从设备的开销，正如已经提到的那样）。

为了使用guvectorize解决这个问题，我们需要更加仔细地考虑已经提到的“切片”概念。让我们重新编写您的guvectorize内核，以便它仅对整体数据的“切片”进行操作，然后允许guvectorize启动函数为其启动多个工作器来处理它，每个切片一个工作器。

在CUDA中，我们喜欢有很多的工作者；实际上你不能有太多。因此，这将影响我们如何“切割”数组，以便让多个工作者可以行动。如果我们沿着第三（最后一个）维度切片，我们只有5个切片可用，所以最多只能有5个工作者。同样，如果我们沿着第一或countRow维度切片，我们将有100个切片，所以最多只能有100个工作者。理想情况下，我们应该沿着第二或countCol维度进行切片。然而为了简单起见，我将沿着第一或countRow维度进行切片。这显然是不优化的，但请参见下面的示例，了解如何处理按第二维度切片问题。按照第一维进行切片意味着我们将从guvectorize内核中去掉第一个for循环，并允许ufunc系统沿着那个维度并行化（基于我们传递的数组的大小）。代码可能看起来像这样：

$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
307.05ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
307.79ms  5.9293ms                    -               -         -         -         -  15.259MB  2.5131GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.34ms  1.3440us                    -               -         -         -         -        8B  5.6766MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.54ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
317.27ms  47.398ms              (2 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms  7.3799ms                    -               -         -         -         -  15.259MB  2.0192GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

观察：

这些代码更改与删除 countCol 参数、从 guvectorize 内核中删除第一个 for 循环以及相应地更改函数签名有关。我们还在签名中将我们的三维函数修改为二维。毕竟，我们正在获取三维数据的二维“切片”，让每个工作进程处理一个切片。
分析器报告的内核维度现在是 2 个块而不是 1 个。这是有道理的，因为在原始实现中，实际上只有 1 个“切片”被呈现，因此只需要 1 个工作进程，因此只有 1 个线程（但 numba 启动了 64 个线程块）。在这个实现中，有 100 个切片，numba 选择启动 2 个包含 64 个工作进程/线程的线程块，以提供所需的 100 个工作进程/线程。
分析器报告的内核性能为 47.4ms，介于原始版本（约为 1.224s）和大规模并行的 vectorize 版本（约为 0.001s）之间。因此，从 1 个到 100 个工作进程的转变使事情加快了很多，但还有更多的性能提升可能。如果您找出如何在 countCol 维度上切片，您可能可以在性能方面更接近 vectorize 版本（请参见下文）。请注意，我们现在的状态（约 47ms）和向量化版本（约 1ms）之间的差异足以弥补传输稍大的 multBy 矩阵的附加传输成本（~5ms 或更少），以便实现 vectorize 的简单性。

关于Python计时的一些补充说明：我认为Python编译原始版本、向量化版本和guvectorize改进版本所需内核的确切行为是不同的。如果我们修改t15.py代码以运行“预热”运行，那么至少Python计时在趋势上与总体墙时和仅内核计时保持一致：

$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839

real    0m2.522s
user    0m1.572s
sys     0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091

real    0m1.050s
user    0m0.473s
sys     0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283

real    0m1.252s
user    0m0.680s
sys     0m0.441s
$

现在，回答评论中的一个问题，有效地说：“我应该如何重构问题以沿着4000（countCol或“中间”）维度切片？”我们可以参考切片第一维的方法。一种可能的方法是重新排列数组的形状，使4000维成为第一维，然后移除它，类似于我们在之前处理guvectorize时所做的。下面是一个实例：

$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[num] * discount[ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof --print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
304.92ms  1.1840us                    -               -         -         -         -        8B  6.4437MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
305.36ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
306.08ms  6.0208ms                    -               -         -         -         -  15.259MB  2.4749GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.44ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.59ms  8.9961ms             (63 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms  7.2772ms                    -               -         -         -         -  15.259MB  2.0476GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

相对可预测的是，我们观察到当我们将数据分成100个worker时，执行时间从约47毫秒降至当我们将数据分成4000个worker时的约9毫秒。同样地，我们观察到numba选择启动63个块，每个块有64个线程，总共4032个线程，以处理这个“slicing”所需的4000个worker。

仍然不如速度约为1毫秒的vectorize内核（其具有更多可用的并行“slices”），但比原始问题中提出的约1.2秒的内核快得多。即使考虑到所有Python开销，Python代码的总体墙时也快了约2倍。

最后观察一下我之前所做的陈述（与评论和其他答案中所做的陈述类似）：

现在我们有方便的测试用例可以在t16.py或t17.py中使用来进行测试。为了简单起见，我将选择t16.py。我们可以通过从guvectorize ufunc中删除目标指定来“将其转换回CPU代码”：

$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741

real    0m0.528s
user    0m0.474s
sys     0m0.047s
$

因此，我们可以看到这个仅使用CPU的版本大约在6毫秒内运行函数，并且它没有GPU“开销”，例如CUDA初始化和数据传输到/从GPU。总体的墙时也是我们最好的测量，大约为0.5秒，而我们最好的GPU情况下则为1.0秒左右。因此，由于每个数据传输字节的低算术强度，这个特定问题可能不适合GPU计算。