新的CUDA纹理对象--在2D情况下获取错误数据

Question

新的CUDA纹理对象--在2D情况下获取错误数据

4

在CUDA 5.0中，NVIDIA添加了“纹理对象”（cudaTextureObject_t），使纹理的使用更加容易。以前，必须将纹理定义为全局变量。

我按照NVIDIA此示例来使用cudaTextureObject_t。对于1D情况，它可以正常工作。我尝试将示例扩展到适用于2D pitched内存。

#define WIDTH 6
#define HEIGHT 2
int width = WIDTH; int height = HEIGHT;
float h_buffer[12] = {1,2,3,4,5,6,7,8,9,10,11,12};
float* d_buffer;
size_t pitch;
cudaMallocPitch(&d_buffer, &pitch, sizeof(float)*width, height);
cudaMemcpy2D(d_buffer, pitch, &h_buffer, sizeof(float)*width, sizeof(float)*width, height, cudaMemcpyHostToDevice);
printf("pitch = %d \n", pitch);

//CUDA 5 texture objects: https://developer.nvidia.com/content/cuda-pro-tip-kepler-texture-objects-improve-performance-and-flexibility
cudaResourceDesc resDesc;
memset(&resDesc, 0, sizeof(resDesc));
resDesc.resType = cudaResourceTypePitch2D;
resDesc.res.pitch2D.devPtr = d_buffer;
resDesc.res.pitch2D.pitchInBytes =  pitch;
resDesc.res.pitch2D.width = width;
resDesc.res.pitch2D.height = height;
resDesc.res.pitch2D.desc.f = cudaChannelFormatKindFloat;
resDesc.res.pitch2D.desc.x = 32; // bits per channel 
resDesc.res.pitch2D.desc.y = 32; 
cudaTextureDesc texDesc;
memset(&texDesc, 0, sizeof(texDesc));
texDesc.readMode = cudaReadModeElementType;
cudaTextureObject_t tex;
cudaCreateTextureObject(&tex, &resDesc, &texDesc, NULL);

为了确定数据是否确实可以通过纹理缓存访问，我在此内核中打印了一些字节：

__global__ void printGpu_tex(cudaTextureObject_t tex) {
    int tidx = blockIdx.x * blockDim.x + threadIdx.x;
    int tidy = blockIdx.y * blockDim.y + threadIdx.y;
    if(tidx < WIDTH && tidy < HEIGHT){
        float x = tex2D<float>(tex, tidy, tidx);
        printf("tex2D<float>(tex, %d, %d) = %f \n", tidy, tidx, x);
    }
}

我原本期望的输出结果是"1,2,3,...,12."，但实际打印出来的是"1,7,7,7,...3,9,..."。

tex2D<float>(tex, 0, 0) = 1.000000 
tex2D<float>(tex, 0, 1) = 7.000000 
tex2D<float>(tex, 0, 2) = 7.000000 
tex2D<float>(tex, 0, 3) = 7.000000 
tex2D<float>(tex, 0, 4) = 7.000000 
tex2D<float>(tex, 0, 5) = 7.000000 
tex2D<float>(tex, 1, 0) = 3.000000 
tex2D<float>(tex, 1, 1) = 9.000000 
tex2D<float>(tex, 1, 2) = 9.000000 
tex2D<float>(tex, 1, 3) = 9.000000 
tex2D<float>(tex, 1, 4) = 9.000000 
tex2D<float>(tex, 1, 5) = 9.000000

为了验证d_buffer数据是否设置正确，我还制作了一个“打印内核”，用于原始的d_buffer数组，而不使用纹理缓存。

__global__ void printGpu_vanilla(float* d_buffer, int pitch) {
    int tidx = blockIdx.x * blockDim.x + threadIdx.x;
    int tidy = blockIdx.y * blockDim.y + threadIdx.y;
    if(tidx < WIDTH && tidy < HEIGHT){
        float x = d_buffer[tidy*pitch + tidx];
        printf("d_buffer[%d][%d] = %f \n", tidy, tidx, x);
    }
}

输出结果看起来很不错（与纹理缓存版本不同）：

d_buffer[0][0] = 1.000000 
d_buffer[0][2] = 2.000000 
d_buffer[0][3] = 3.000000 
d_buffer[0][4] = 4.000000 
d_buffer[0][5] = 5.000000 
d_buffer[0][5] = 6.000000 
d_buffer[1][0] = 7.000000 
d_buffer[1][6] = 8.000000 
d_buffer[1][7] = 9.000000 
d_buffer[1][8] = 10.000000 
d_buffer[1][9] = 11.000000 
d_buffer[1][5] = 12.000000

有什么想法是关于纹理缓存版本出了什么问题吗？

下载链接：

- solvingPuzzles

我的猜测是问题的一部分在于cudaMallocPitch与cudaMallocArray之间的区别。在旧的纹理缓存API中，cudaMallocArray是典型的使用方式。但是，cudaMallocArray需要一个cudaChannelFormatDesc，而这在新的cudaTextureObject_t接口中似乎已经过时了。 - solvingPuzzles

2个回答

0

除了 cudaChannelFormatDesc 外，你的代码中似乎存在一个逻辑问题，这并不是什么大问题，但如果你不小心就会非常误导人。如果你想遵循 CUDA 线程组织成块和网格以及包装的调度方式（此外，如果你想让你的代码与 C++ 的“行优先”概念保持一致），最好将 x 视为最快变化的维度（类似于行优先）。由于你的代码表明 y 变化比 x 更快，更合适的方式是交换代码中的索引：

float x = tex2D<float>(tex, tidx, tidy);
printf("tex2D<float>(tex, %d, %d) = %f \n", tidx, tidy, x);
...
printf("d_buffer[%d][%d] = %f \n", tidx, tidy, x);

值得再次提醒的是，这并不是一个大问题，但同时可能会非常令人困惑，特别是当您想将此内核与代码的其他部分集成时。

- Mohsen

1

没有任何原因就进行 DOWN VOTE 是不好的行为！至少您可以留下评论以指出错误。这样，人们就会知道为什么会有问题（知道问题所在是科学的一部分）。 - Mohsen

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- kunzmi · Accepted Answer

你在resDesc.res.pitch2D.desc中的cudaChannelFormatDesc是错误的：应该将y设置为0。

要正确设置FormatDesc，请使用CreateChannelDesc<>()函数，例如resDesc.res.pitch2D.desc = cudaCreateChannelDesc<float>();，而不是手动设置它。

resDesc.res.pitch2D.desc.y = 32对于float2纹理是有效的。