我试图理解CUDA SDK 8.0中这个示例代码的工作方式:
template <int BLOCK_SIZE> __global__ void
matrixMulCUDA(float *C, float *A, float *B, int wA, int wB)
{
// Block index
int bx = blockIdx.x;
int by = blockIdx.y;
// Thread index
int tx = threadIdx.x;
int ty = threadIdx.y;
// Index of the first sub-matrix of A processed by the block
int aBegin = wA * BLOCK_SIZE * by;
// Index of the last sub-matrix of A processed by the block
int aEnd = aBegin + wA - 1;
// Step size used to iterate through the sub-matrices of A
int aStep = BLOCK_SIZE;
// Index of the first sub-matrix of B processed by the block
int bBegin = BLOCK_SIZE * bx;
// Step size used to iterate through the sub-matrices of B
int bStep = BLOCK_SIZE * wB;
....
....
对于我来说,内核的这一部分相当棘手。我知道矩阵A和B表示为数组(*float),我也知道使用共享内存以计算点积的概念。
我的问题是,我不理解代码的开头,特别是3个特定变量(aBegin
、aEnd
和bBegin
)。能否给我一个可能执行的示例图,帮助我理解这个特定情况下索引的工作方式?谢谢