Numba和guvectorize用于CUDA目标:代码运行速度比预期慢

3

值得注意的细节:

  • 大型数据集(10百万个 x 5),(200 x 10百万个 x 5)
  • 大部分使用Numpy
  • 每次运行后需要更长的时间
  • 使用Spyder3
  • Windows 10操作系统

首先尝试使用以下函数和guvectorize。我正在传递一堆numpy数组,并试图使用它们在两个数组上执行乘法。如果在除cuda以外的目标上运行,则可以正常工作。然而,当切换到cuda时,会导致一个未知错误:

File "C:\ProgramData\Anaconda3\lib\site-packages\numba\cuda\decorators.py", >line 82, in jitwrapper debug=debug)

TypeError: init() got an unexpected keyword argument 'debug'

在查找此错误时,我遇到了很多死胡同。我猜想这应该是一个非常简单的修复程序,但我完全错过了。还应该注意,只有在运行一次并由于内存超载而崩溃后才会发生此错误。

os.environ["NUMBA_ENABLE_CUDASIM"] = "1"

os.environ["CUDA_VISIBLE_DEVICES"] = "10DE 1B06 63933842"
...

所有的数组都是numpy

@guvectorize(['void(int64, float64[:,:], float64[:,:], float64[:,:,:], 
int64, int64, float64[:,:,:])'], '(),(m,o),(m,o),(n,m,o),(),() -> (n,m,o)', 
target='cuda', nopython=True)
def cVestDiscount (ed, orCV, vals, discount, n, rowCount, cv):
    for as_of_date in range(0,ed):
        for ID in range(0,rowCount):
            for num in range(0,n):
                cv[as_of_date][ID][num] = orCV[ID][num] * discount[as_of_date][ID][num]

尝试在命令行中使用nvprofiler运行代码会导致以下错误:
警告:由于在此多GPU设置中检测到不支持互连的设备对,因此当前配置不支持统一内存分析。当对等映射不可用时,系统将回退到使用零拷贝内存。这可能导致访问统一内存的核心函数运行速度变慢。更多细节请参见:http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#um-managed-memory 我意识到我正在使用启用SLI的显卡(两张卡是相同的,evga gtx 1080ti,并具有相同的设备id),因此我禁用了SLI并添加了"CUDA_VISIBLE_DEVICES"参数限制到另一个卡上,但结果相同。
我仍然可以通过nvprof运行代码,但与nijt(parallel=True)和prange相比,cuda函数速度较慢。通过使用较小的数据大小我们可以运行该代码,但它比target='parallel'和target='cpu'还要慢。
为什么cuda速度如此缓慢,这些错误是什么意思?
感谢您的帮助!
编辑:
以下是该代码的工作示例。
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:,:], float64[:,:,:], int64, int64, float64[:,:,:])'], '(),(m,o),(n,m,o),(),() -> (n,m,o)', target='cuda', nopython=True)
def cVestDiscount (countRow, multBy, discount, n, countCol, cv):
    for as_of_date in range(0,countRow):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[as_of_date][ID][num] = multBy[ID][num] * discount[as_of_date][ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))

我能够使用gtx 1080ti在cuda上运行代码,但是与并行或cpu运行相比,速度慢得多。我查看了其他有关guvectorize的帖子,但它们都没有帮助我理解什么是最优的运行guvectorize以及什么不是。是否有任何方法使此代码“适用于cuda”,或者仅跨数组进行乘法过于简单,无法产生任何好处?


2
我建议提供一个 [mcve]。 - Robert Crovella
那个 nvprof 输出是一个警告,而不是一个错误。 - talonmies
如果这是在GPU上唯一执行的操作(工作流程:将数据复制到GPU->计算->将结果复制回来),那么预计速度会较慢,因为内存复制所需时间比CPU上的计算时间更长。像这样的简单标量操作完全受到内存带宽的限制。 - max9111
2个回答

5
首先,您展示的基本操作是将两个矩阵传输到GPU,执行一些逐元素乘法以产生第三个数组,并将该第三个数组传回主机。可能可以实现numba/cuda guvectorize(或cuda.jit kernel)实现,它可能比朴素的串行python实现运行得更快,但我怀疑无法超过编写良好的主机代码(例如使用某些并行化方法,如guvectorize)来完成相同的操作的性能。这是因为每个字节在主机和设备之间传输的算术强度太低。这个操作太简单了。
其次,我认为重要的是要先了解numba中vectorizeguvectorize的用途。基本原则是从“工作人员将做什么?”的角度编写ufunc定义,然后允许numba从中启动多个工作人员。指导numba启动多个工作人员的方法是传递一个大于您给出的签名的数据集。需要注意的是,numba不知道如何并行化ufunc定义内部的for循环。它通过将您的ufunc定义在并行工作人员之间运行,并在每个工作人员处理“切片”数据时运行整个ufunc定义来获得并行“强度”。如果需要更多阅读,我也在这里涵盖了一些相关内容。
在您的实现中存在一个问题,即您编写了一个将整个输入数据集映射到单个工作程序(和ufunc)的签名。正如@talonmies所示,您的基础内核正在使用总共64个线程/工作程序(这对于GPU来说太小而不足为奇,即使与上述算术强度无关)。但是我怀疑事实上64实际上只是numba最小线程块大小,而实际上只有1个线程在该线程块中执行任何有用的计算工作。这一个线程以串行方式执行您的整个ufunc,包括所有for循环。

显然,这不是任何人对vectorizeguvectorize合理使用的意图。

让我们回顾一下你要做什么。最终,您的ufunc想要将一个数组中的输入值乘以另一个数组中的输入值,并将结果存储在第三个数组中。我们希望重复执行这个过程很多次。如果所有3个数组大小相同,我们实际上可以使用vectorize来实现这一点,甚至不必诉诸更复杂的guvectorize。让我们比较这种方法和您的原始方法,着重于CUDA内核执行。这是一个完整的示例,其中t14.py是您的原始代码,使用分析器运行,而t15.py是它的vectorize版本,承认我们已经改变了multBy数组的大小来匹配cvdiscount:
$ nvprof --print-gpu-trace python t14.py
==4145== NVPROF is profiling process 4145, command: python t14.py
Function: discount factor cumVest duration (seconds):1.24354910851
==4145== Profiling application: python t14.py
==4145== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
312.36ms  1.2160us                    -               -         -         -         -        8B  6.2742MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.81ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.52ms  5.8696ms                    -               -         -         -         -  15.259MB  2.5387GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.74ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
319.93ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
321.40ms  1.22538s              (1 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [37]
1.54678s  7.1816ms                    -               -         -         -         -  15.259MB  2.0749GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t15.py
==4167== NVPROF is profiling process 4167, command: python t15.py
Function: discount factor cumVest duration (seconds):0.37507891655
==4167== Profiling application: python t15.py
==4167== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
193.92ms  6.2729ms                    -               -         -         -         -  15.259MB  2.3755GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
201.09ms  5.7101ms                    -               -         -         -         -  15.259MB  2.6096GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
364.92ms  842.49us          (15625 1 1)       (128 1 1)        13        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__vectorized_cVestDiscount$242(Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>, Array<double, int=1, A, mutable, aligned>) [31]
365.77ms  7.1528ms                    -               -         -         -         -  15.259MB  2.0833GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

我们看到你的应用程序报告了大约1.244秒的运行时间,而向量化版本报告了大约0.375秒的运行时间。但这两个数字都包含Python开销。如果我们查看分析器中生成的CUDA内核持续时间,差异甚至更加明显。我们看到原始内核花费了大约1.225秒,而向量化内核在大约842微秒(即不到1毫秒)内执行。我们还注意到,计算内核时间现在比将3个数组传输到/从GPU所需的时间要小得多(总共需要约20毫秒),并且我们注意到内核尺寸现在为15625个128线程块,每个块有一个总线程/工作数为2000000,正好匹配要执行的总乘法操作数量,并且比原始代码中可怜的64个线程(可能只有1个线程)要多得多。
考虑到上述“向量化”方法的简单性,如果您真正想做的是逐元素乘法,则可以考虑复制“multBy”,使其在尺寸上与其他两个数组匹配,并完成它。
但问题仍然存在:如何处理不同的输入数组大小,就像原始问题一样?为此,我认为我们需要使用guvectorize(或者,正如@talonmies所指出的那样,编写自己的@cuda.jit内核,这可能是最好的建议,尽管可能没有这些方法中的任何一种能够克服数据传输到/从设备的开销,正如已经提到的那样)。

为了使用guvectorize解决这个问题,我们需要更加仔细地考虑已经提到的“切片”概念。让我们重新编写您的guvectorize内核,以便它仅对整体数据的“切片”进行操作,然后允许guvectorize启动函数为其启动多个工作器来处理它,每个切片一个工作器。

在CUDA中,我们喜欢有很多的工作者;实际上你不能有太多。因此,这将影响我们如何“切割”数组,以便让多个工作者可以行动。如果我们沿着第三(最后一个)维度切片,我们只有5个切片可用,所以最多只能有5个工作者。同样,如果我们沿着第一或countRow维度切片,我们将有100个切片,所以最多只能有100个工作者。理想情况下,我们应该沿着第二或countCol维度进行切片。然而为了简单起见,我将沿着第一或countRow维度进行切片。这显然是不优化的,但请参见下面的示例,了解如何处理按第二维度切片问题。按照第一维进行切片意味着我们将从guvectorize内核中去掉第一个for循环,并允许ufunc系统沿着那个维度并行化(基于我们传递的数组的大小)。代码可能看起来像这样:
$ cat t16.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ nvprof --print-gpu-trace python t16.py
==4275== NVPROF is profiling process 4275, command: python t16.py
Function: discount factor cumVest duration (seconds):0.0670170783997
==4275== Profiling application: python t16.py
==4275== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
307.05ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
307.79ms  5.9293ms                    -               -         -         -         -  15.259MB  2.5131GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.34ms  1.3440us                    -               -         -         -         -        8B  5.6766MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
314.54ms     896ns                    -               -         -         -         -        8B  8.5149MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
317.27ms  47.398ms              (2 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<double, int=3, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
364.67ms  7.3799ms                    -               -         -         -         -  15.259MB  2.0192GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

观察:

  1. 这些代码更改与删除 countCol 参数、从 guvectorize 内核中删除第一个 for 循环以及相应地更改函数签名有关。我们还在签名中将我们的三维函数修改为二维。毕竟,我们正在获取三维数据的二维“切片”,让每个工作进程处理一个切片。

  2. 分析器报告的内核维度现在是 2 个块而不是 1 个。这是有道理的,因为在原始实现中,实际上只有 1 个“切片”被呈现,因此只需要 1 个工作进程,因此只有 1 个线程(但 numba 启动了 64 个线程块)。在这个实现中,有 100 个切片,numba 选择启动 2 个包含 64 个工作进程/线程的线程块,以提供所需的 100 个工作进程/线程。

  3. 分析器报告的内核性能为 47.4ms,介于原始版本(约为 1.224s)和大规模并行的 vectorize 版本(约为 0.001s)之间。因此,从 1 个到 100 个工作进程的转变使事情加快了很多,但还有更多的性能提升可能。如果您找出如何在 countCol 维度上切片,您可能可以在性能方面更接近 vectorize 版本(请参见下文)。请注意,我们现在的状态(约 47ms)和向量化版本(约 1ms)之间的差异足以弥补传输稍大的 multBy 矩阵的附加传输成本(~5ms 或更少),以便实现 vectorize 的简单性。

关于Python计时的一些补充说明:我认为Python编译原始版本、向量化版本和guvectorize改进版本所需内核的确切行为是不同的。如果我们修改t15.py代码以运行“预热”运行,那么至少Python计时在趋势上与总体墙时和仅内核计时保持一致:

$ cat t15.py
import numpy as np
from numba import guvectorize,vectorize
import time
from timeit import default_timer as timer


@vectorize(['float64(float64, float64)'], target='cuda')
def cVestDiscount (a, b):
    return a * b

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
multBy = np.full_like(discount, 1)
cv = np.empty_like(discount)
#warm-up run
cv = cVestDiscount(multBy, discount)
func_start = timer()
cv = cVestDiscount(multBy, discount)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ time python t14.py
Function: discount factor cumVest duration (seconds):1.24376320839

real    0m2.522s
user    0m1.572s
sys     0m0.809s
$ time python t15.py
Function: discount factor cumVest duration (seconds):0.0228319168091

real    0m1.050s
user    0m0.473s
sys     0m0.445s
$ time python t16.py
Function: discount factor cumVest duration (seconds):0.0665760040283

real    0m1.252s
user    0m0.680s
sys     0m0.441s
$

现在,回答评论中的一个问题,有效地说:“我应该如何重构问题以沿着4000(countCol或“中间”)维度切片?”我们可以参考切片第一维的方法。一种可能的方法是重新排列数组的形状,使4000维成为第一维,然后移除它,类似于我们在之前处理guvectorize时所做的。下面是一个实例:
$ cat t17.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(int64, float64[:], float64[:,:], int64, float64[:,:])'], '(),(o),(m,o),() -> (m,o)', target='cuda', nopython=True)
def cVestDiscount (countCol, multBy, discount, n, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[num] * discount[ID][num]

countRow = np.int64(100)
multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(4000,100,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(4000,100,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(countRow, multBy, discount, n, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
[bob@cluster2 python]$ python t17.py
Function: discount factor cumVest duration (seconds):0.0266749858856
$ nvprof --print-gpu-trace python t17.py
==8544== NVPROF is profiling process 8544, command: python t17.py
Function: discount factor cumVest duration (seconds):0.0268459320068
==8544== Profiling application: python t17.py
==8544== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput  SrcMemType  DstMemType           Device   Context    Stream  Name
304.92ms  1.1840us                    -               -         -         -         -        8B  6.4437MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
305.36ms  27.392us                    -               -         -         -         -  156.25KB  5.4400GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
306.08ms  6.0208ms                    -               -         -         -         -  15.259MB  2.4749GB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
312.44ms  1.0880us                    -               -         -         -         -        8B  7.0123MB/s    Pageable      Device  Quadro K2000 (0         1         7  [CUDA memcpy HtoD]
313.59ms  8.9961ms             (63 1 1)        (64 1 1)        63        0B        0B         -           -           -           -  Quadro K2000 (0         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=2, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>) [35]
322.59ms  7.2772ms                    -               -         -         -         -  15.259MB  2.0476GB/s      Device    Pageable  Quadro K2000 (0         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.
SrcMemType: The type of source memory accessed by memory operation/copy
DstMemType: The type of destination memory accessed by memory operation/copy
$

相对可预测的是,我们观察到当我们将数据分成100个worker时,执行时间从约47毫秒降至当我们将数据分成4000个worker时的约9毫秒。同样地,我们观察到numba选择启动63个块,每个块有64个线程,总共4032个线程,以处理这个“slicing”所需的4000个worker。
仍然不如速度约为1毫秒的vectorize内核(其具有更多可用的并行“slices”),但比原始问题中提出的约1.2秒的内核快得多。即使考虑到所有Python开销,Python代码的总体墙时也快了约2倍。
最后观察一下我之前所做的陈述(与评论和其他答案中所做的陈述类似):
现在我们有方便的测试用例可以在t16.py或t17.py中使用来进行测试。为了简单起见,我将选择t16.py。我们可以通过从guvectorize ufunc中删除目标指定来“将其转换回CPU代码”:
$ cat t16a.py
import numpy as np
from numba import guvectorize
import time
from timeit import default_timer as timer


@guvectorize(['void(float64[:,:], float64[:,:], int64, int64, float64[:,:])'], '(m,o),(m,o),(),() -> (m,o)')
def cVestDiscount (multBy, discount, n, countCol, cv):
        for ID in range(0,countCol):
            for num in range(0,n):
                cv[ID][num] = multBy[ID][num] * discount[ID][num]

multBy = np.float64(np.arange(20000).reshape(4000,5))
discount = np.float64(np.arange(2000000).reshape(100,4000,5))
n = np.int64(5)
countCol = np.int64(4000)
cv = np.zeros(shape=(100,4000,5), dtype=np.float64)
func_start = timer()
cv = cVestDiscount(multBy, discount, n, countCol, cv)
timing=timer()-func_start
print("Function: discount factor cumVest duration (seconds):" + str(timing))
$ time python t16a.py
Function: discount factor cumVest duration (seconds):0.00657796859741

real    0m0.528s
user    0m0.474s
sys     0m0.047s
$

因此,我们可以看到这个仅使用CPU的版本大约在6毫秒内运行函数,并且它没有GPU“开销”,例如CUDA初始化和数据传输到/从GPU。总体的墙时也是我们最好的测量,大约为0.5秒,而我们最好的GPU情况下则为1.0秒左右。因此,由于每个数据传输字节的低算术强度,这个特定问题可能不适合GPU计算。


非常感谢您的回答,它真正为我解决了关于Numba的许多疑惑。当我在我的机器上运行向量化代码时,我发现我的块大小设置为64,而您的块大小为128(参考我的机器是GTX 1080ti)。是否有限制,或者也许可以通过更改Nvidia设置或添加一些代码来修复?其次,关于guvectorize,我尝试沿第二个轴切片数组,以缩短时间。 - Bryce Booze
块大小为64与128不应该是一个重大的性能问题。这不是卡片/硬件限制(硬件限制为1024)。我不会太担心它。我们可能在运行不同版本的numba。在所有这些中,这不是我关注的重点。 - Robert Crovella
float64[:] 同时去掉最外层的for循环,但是我得到了一个ValueError: arg #1: outer dimension mismatch错误。您认为我是否错误地使用了签名,将其作为一维数组并删除另一个for循环? - Bryce Booze
最终我也遇到了外部维度不匹配的问题。因此,我怀疑沿着你目前所做的简单分解线路可能无法与guvectorize一起使用。但是,当我们仅删除countRow外部循环时,有一个线索表明它确实有效。因此,我的方法是重塑数组以使4000维成为外部维度,并尝试使用guvectorize来消除它。请参见我编辑后的答案,其中包含我所想的工作示例(t17.py)。 - Robert Crovella
非常感谢您的所有帮助!我明白cuda不是这个特定计算的最佳选择,但我对这两个装饰器有了更好的理解,这在将来肯定会有所帮助。 - Bryce Booze

1
"

如果使用CUDA 8.0的Numba 0.38.1进行剖析,可以立即看出gufunc Numba发出和运行缓慢的原因。

"
==24691== Profiling application: python slowvec.py
==24691== Profiling result:
   Start  Duration            Grid Size      Block Size     Regs*    SSMem*    DSMem*      Size  Throughput           Device   Context    Stream  Name
271.33ms  1.2800us                    -               -         -         -         -        8B  5.9605MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
271.65ms  14.591us                    -               -         -         -         -  156.25KB  10.213GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
272.09ms  2.5868ms                    -               -         -         -         -  15.259MB  5.7605GB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
274.98ms     992ns                    -               -         -         -         -        8B  7.6909MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
275.17ms     640ns                    -               -         -         -         -        8B  11.921MB/s  GeForce GTX 970         1         7  [CUDA memcpy HtoD]
276.33ms  657.28ms              (1 1 1)        (64 1 1)        40        0B        0B         -           -  GeForce GTX 970         1         7  cudapy::__main__::__gufunc_cVestDiscount$242(Array<__int64, int=1, A, mutable, aligned>, Array<double, int=3, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<__int64, int=1, A, mutable, aligned>, Array<double, int=4, A, mutable, aligned>) [38]
933.62ms  3.5128ms                    -               -         -         -         -  15.259MB  4.2419GB/s  GeForce GTX 970         1         7  [CUDA memcpy DtoH]

Regs: Number of registers used per CUDA thread. This number includes registers used internally by the CUDA driver and/or tools and can be more than what the compiler shows.
SSMem: Static shared memory allocated per CUDA block.
DSMem: Dynamic shared memory allocated per CUDA block.

生成的内核启动运行代码使用了一个64个线程的单个块。在一个GPU上,理论上每个MP最多可以有2048个线程,并且有23个MP,这意味着您的GPU的理论处理能力约有99.9%未被使用。这似乎是numba开发人员荒谬的设计选择,如果它妨碍了您(而且看起来确实如此),我将报告它为一个错误。
显然的解决方案是将您的函数重写为CUDA Python内核方言的@cuda.jit函数,并显式控制执行参数。这样,您至少可以确保代码将以足够的线程数运行,以潜在地使用硬件的所有容量。它仍然是一个非常内存绑定的操作,因此您可以在速度提升方面取得的成果可能会受到限制,甚至可能远远低于您GPU与CPU的内存带宽比率。即使这是最好的情况,也可能不足以摊销主机到设备内存传输的成本,因此可能没有性能收益。
简而言之,请注意自动化编译器生成的并行性的危险...

补充一下,我成功地找到了如何获取由numba发出的代码的PTX,并且可以说它绝对是糟糕的(而且太长了,我实际上无法发布所有内容):

{
    .reg .pred  %p<9>;
    .reg .b32   %r<8>;
    .reg .f64   %fd<4>;
    .reg .b64   %rd<137>;


    ld.param.u64    %rd29, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_5];
    ld.param.u64    %rd31, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_11];
    ld.param.u64    %rd32, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
    ld.param.u64    %rd34, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_14];
    ld.param.u64    %rd35, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_15];
    ld.param.u64    %rd36, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_16];
    ld.param.u64    %rd37, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_17];
    ld.param.u64    %rd38, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_22];
    ld.param.u64    %rd39, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_23];
    ld.param.u64    %rd40, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_24];
    ld.param.u64    %rd41, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_25];
    ld.param.u64    %rd42, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_26];
    ld.param.u64    %rd43, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_27];
    ld.param.u64    %rd44, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_28];
    ld.param.u64    %rd45, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_29];
    ld.param.u64    %rd46, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_30];
    ld.param.u64    %rd48, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_36];
    ld.param.u64    %rd51, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_43];
    ld.param.u64    %rd53, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_49];
    ld.param.u64    %rd54, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_50];
    ld.param.u64    %rd55, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_51];
    ld.param.u64    %rd56, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_52];
    ld.param.u64    %rd57, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_53];
    ld.param.u64    %rd58, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_54];
    ld.param.u64    %rd59, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_55];
    ld.param.u64    %rd60, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_56];
    ld.param.u64    %rd61, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_57];
    mov.u32     %r1, %tid.x;
    mov.u32     %r3, %ctaid.x;
    mov.u32     %r2, %ntid.x;
    mad.lo.s32  %r4, %r3, %r2, %r1;
    min.s64     %rd62, %rd32, %rd29;
    min.s64     %rd63, %rd39, %rd62;
    min.s64     %rd64, %rd48, %rd63;
    min.s64     %rd65, %rd51, %rd64;
    min.s64     %rd66, %rd54, %rd65;
    cvt.s64.s32 %rd1, %r4;
    setp.le.s64 %p2, %rd66, %rd1;
    @%p2 bra    BB0_8;

    ld.param.u64    %rd126, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_42];
    ld.param.u64    %rd125, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_44];
    ld.param.u64    %rd124, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_35];
    ld.param.u64    %rd123, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_37];
    ld.param.u64    %rd122, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_4];
    ld.param.u64    %rd121, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_6];
    cvt.u32.u64 %r5, %rd1;
    setp.lt.s32 %p1, %r5, 0;
    selp.b64    %rd67, %rd29, 0, %p1;
    add.s64     %rd68, %rd67, %rd1;
    mul.lo.s64  %rd69, %rd68, %rd121;
    add.s64     %rd70, %rd69, %rd122;
    selp.b64    %rd71, %rd48, 0, %p1;
    add.s64     %rd72, %rd71, %rd1;
    mul.lo.s64  %rd73, %rd72, %rd123;
    add.s64     %rd74, %rd73, %rd124;
    ld.u64  %rd2, [%rd74];
    selp.b64    %rd75, %rd51, 0, %p1;
    add.s64     %rd76, %rd75, %rd1;
    mul.lo.s64  %rd77, %rd76, %rd125;
    add.s64     %rd78, %rd77, %rd126;
    ld.u64  %rd3, [%rd78];
    ld.u64  %rd4, [%rd70];
    setp.lt.s64 %p3, %rd4, 1;
    @%p3 bra    BB0_8;

    ld.param.u64    %rd128, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_13];
    ld.param.u64    %rd127, [_ZN6cudapy8__main__26__gufunc_cVestDiscount$242E5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi3E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIxLi1E1A7mutable7alignedE5ArrayIdLi4E1A7mutable7alignedE_param_12];
    selp.b64    %rd80, %rd127, 0, %p1;
    mov.u64     %rd79, 0;
    min.s64     %rd81, %rd128, %rd79;
    min.s64     %rd82, %rd34, %rd79;
    selp.b64    %rd83, %rd39, 0, %p1;
    min.s64     %rd84, %rd40, %rd79;
    min.s64     %rd85, %rd41, %rd79;
    min.s64     %rd86, %rd42, %rd79;
    selp.b64    %rd87, %rd54, 0, %p1;
    min.s64     %rd88, %rd55, %rd79;
    min.s64     %rd89, %rd56, %rd79;
    min.s64     %rd90, %rd57, %rd79;
    mul.lo.s64  %rd91, %rd90, %rd61;
    add.s64     %rd92, %rd53, %rd91;
    mul.lo.s64  %rd93, %rd89, %rd60;
    add.s64     %rd94, %rd92, %rd93;
    mul.lo.s64  %rd95, %rd88, %rd59;
    add.s64     %rd96, %rd94, %rd95;
    add.s64     %rd98, %rd87, %rd1;
    mul.lo.s64  %rd99, %rd58, %rd98;
    add.s64     %rd5, %rd96, %rd99;
    mul.lo.s64  %rd100, %rd86, %rd46;
    add.s64     %rd101, %rd38, %rd100;
    mul.lo.s64  %rd102, %rd85, %rd45;
    add.s64     %rd103, %rd101, %rd102;
    mul.lo.s64  %rd104, %rd84, %rd44;
    add.s64     %rd105, %rd103, %rd104;
    add.s64     %rd106, %rd83, %rd1;
    mul.lo.s64  %rd107, %rd43, %rd106;
    add.s64     %rd6, %rd105, %rd107;
    mul.lo.s64  %rd108, %rd82, %rd37;
    add.s64     %rd109, %rd31, %rd108;
    mul.lo.s64  %rd110, %rd81, %rd36;
    add.s64     %rd111, %rd109, %rd110;
    add.s64     %rd112, %rd80, %rd1;
    mul.lo.s64  %rd113, %rd35, %rd112;
    add.s64     %rd7, %rd111, %rd113;
    add.s64     %rd8, %rd2, 1;
    mov.u64     %rd131, %rd79;

BB0_3:
    mul.lo.s64  %rd115, %rd59, %rd131;
    add.s64     %rd10, %rd5, %rd115;
    mul.lo.s64  %rd116, %rd44, %rd131;
    add.s64     %rd11, %rd6, %rd116;
    setp.lt.s64 %p4, %rd3, 1;
    mov.u64     %rd130, %rd79;
    mov.u64     %rd132, %rd3;
    @%p4 bra    BB0_7;

BB0_4:
    mov.u64     %rd13, %rd132;
    mov.u64     %rd12, %rd130;
    mul.lo.s64  %rd117, %rd60, %rd12;
    add.s64     %rd136, %rd10, %rd117;
    mul.lo.s64  %rd118, %rd45, %rd12;
    add.s64     %rd135, %rd11, %rd118;
    mul.lo.s64  %rd119, %rd36, %rd12;
    add.s64     %rd134, %rd7, %rd119;
    setp.lt.s64 %p5, %rd2, 1;
    mov.u64     %rd133, %rd8;
    @%p5 bra    BB0_6;

BB0_5:
    mov.u64     %rd17, %rd133;
    ld.f64  %fd1, [%rd135];
    ld.f64  %fd2, [%rd134];
    mul.f64     %fd3, %fd2, %fd1;
    st.f64  [%rd136], %fd3;
    add.s64     %rd136, %rd136, %rd61;
    add.s64     %rd135, %rd135, %rd46;
    add.s64     %rd134, %rd134, %rd37;
    add.s64     %rd24, %rd17, -1;
    setp.gt.s64 %p6, %rd24, 1;
    mov.u64     %rd133, %rd24;
    @%p6 bra    BB0_5;

BB0_6:
    add.s64     %rd25, %rd13, -1;
    add.s64     %rd26, %rd12, 1;
    setp.gt.s64 %p7, %rd13, 1;
    mov.u64     %rd130, %rd26;
    mov.u64     %rd132, %rd25;
    @%p7 bra    BB0_4;

BB0_7:
    sub.s64     %rd120, %rd4, %rd131;
    add.s64     %rd131, %rd131, 1;
    setp.gt.s64 %p8, %rd120, 1;
    @%p8 bra    BB0_3;

BB0_8:
    ret;
}

所有这些整数操作只为执行一次双精度乘法!


谢谢你的回复,我已经尝试使用@cuda.jit签名,但是线程块数出现了问题。你在第一个性能分析示例中是如何获取网格大小和块大小的?再次感谢您的快速回复。 - Bryce Booze
@BryceB 如果你使用 --print-gpu-trace 选项运行 nvprof,你会看到相同的输出。但是如果你编写自己的 @cuda.jit 内核,你需要负责设置网格和块大小。 - talonmies

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接