PyCUDA核函数

Question

PyCUDA核函数

3

我刚开始接触PyCUDA，并且正在查看PyCUDA网站上的一些示例。我试图找出某些代码行背后的逻辑，并希望有人能解释其中的思路。

以下代码片段来自PyCUDA网站。在函数定义中，我不理解

int idx = threadIdx.x + threadIdx.y*4;

上述代码行是用来计算数组索引的。为什么要将threadIdx.x和threadIdx.y相加，而且为什么要将threadIdx.y乘以4。

在GPU函数调用中，为什么块被定义为5,5,1。因为它是一个5x5元素的数组，所以按照我的理解，块大小应该是5,5而不是5,5,1。

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(5,5)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doubleMatrix(float *a)
{
int idx = threadIdx.x + threadIdx.y*4;
a[idx] *= 2;
}
""")
func = mod.get_function("doubleMatrix")
func(a_gpu, block=(5,5,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print ("ORIGINAL MATRIX")
print a
print ("DOUBLED MATRIX AFTER PyCUDA EXECUTION")
print a_doubled

- user2808264

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- talonmies · Accepted Answer

你发布的示例似乎来自于（或被抄袭到）一本名为“Python并行编程食谱”的书，直到五分钟前我才听说过。老实说，如果我是那本书的作者，我会为包含如此粗糙、有缺陷的示例感到羞耻。

这里是对你发布内容的小修改及其输出：

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(5,5)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doubleMatrix(float *a)
{
     int idx = threadIdx.x + threadIdx.y*4;
     a[idx] *= 2.f;
}
""")
func = mod.get_function("doubleMatrix")
func(a_gpu, block=(5,5,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled - 2.0*a

[警告：Python 2 语法]

In [2]: %run matdouble.py
[[ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.          0.          0.          0.          0.        ]
 [ 0.         -0.62060976  0.49836278 -1.60820103  1.71903515]]

即代码不像预期的那样工作，这可能是您困惑的根源。

正确处理存储在线性内存中的多维数组（如numpy数组）的方法在这个最近的答案中有描述。任何明智的程序员都会像您的示例中一样编写内核，如下所示：

__global__ void doubleMatrix(float *a, int lda)
{
     int idx = threadIdx.x + threadIdx.y * lda;
     a[idx] *= 2.f;
}

这样，数组的主维度将作为参数传递给内核（在本例中应该是5而不是4）。这样做会得到以下结果：

import pycuda.driver as cuda
import pycuda.autoinit
from pycuda.compiler import SourceModule
import numpy
a = numpy.random.randn(5,5)
a = a.astype(numpy.float32)
a_gpu = cuda.mem_alloc(a.nbytes)
cuda.memcpy_htod(a_gpu, a)
mod = SourceModule("""
__global__ void doubleMatrix(float *a, int lda)
{
     int idx = threadIdx.x + threadIdx.y * lda;
     a[idx] *= 2.f;
}
""")
func = mod.get_function("doubleMatrix")
lda = numpy.int32(a.shape[-1])
func(a_gpu, lda, block=(5,5,1))
a_doubled = numpy.empty_like(a)
cuda.memcpy_dtoh(a_doubled, a_gpu)
print a_doubled - 2.0*a

这会产生预期的结果：

In [3]: %run matdouble.py
[[ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.]]

在CUDA中，所有的块都隐式地有三个维度。块大小为(5,5)与块大小为(5,5,1)相同。最后一个维度可以忽略，因为它是1（即块中的所有线程都具有threadIdx.z = 1）。你不应该陷入将CUDA块或网格的维度与输入数组的维度混淆的陷阱。有时让它们相同很方便，但同样也没有必要甚至不建议这样做。对于这个例子，以BLAS风格编写的正确内核（假设行主存储顺序）可能是这样的：

__global__ void doubleMatrix(float *a, int m, int n, int lda)
{
     int col = threadIdx.x + blockIdx.x * blockDim.x;
     int row = threadIdx.y + blockDim.y * blockDim.y;

     for(; row < m; row += blockDim.y * gridDim.y) {
         for(; col < n; col += blockDim.x * gridDim.x) {
             int idx = col + row * lda;
             a[idx] *= 2.f;
         }
    }
}

[注意：此代码是在浏览器中编写，未被编译或测试]

在这里，任何合法的块和网格尺寸都可以正确处理任意大小的输入数组元素总数，使其适应于带符号32位整数。如果运行太多线程，则一些线程将不起作用。如果运行的线程太少，则某些线程将处理多个数组元素。如果您运行与输入数组具有相同维度的网格，则每个线程将恰好处理一个输入，正如您所学习的示例的意图一样。如果您想了解如何选择最合适的块和网格尺寸，请从这里开始阅读。