将结构体传递给CUDA内核函数

Question

将结构体传递给CUDA内核函数

17

我是CUDA C的新手，正在尝试将一个typedef的结构体传递到内核中。当我使用只包含int的结构体时，我的方法很好用，但是当我切换到floats时，结果返回无意义的数字。我认为这与对齐有关，并且我尝试在类型声明中包含__align__，但没有成功。有人可以给我一个示例说明如何做到这一点，或者提供替代方法吗？我正在尝试设置它，以便我可以轻松添加或删除字段，而不更改结构和内核之外的任何内容。我的代码：

typedef struct __align__(8)
{
    float a, b;
} point;

__global__ void testKernel(point *p)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    p[i].a = 1.1;
    p[i].b = 2.2;
}

int main(void)
{
        // set number of points 
    int numPoints    = 16,
        gpuBlockSize = 4,
        pointSize    = sizeof(point),
        numBytes     = numPoints * pointSize,
        gpuGridSize  = numPoints / gpuBlockSize;

        // allocate memory
    point *cpuPointArray = new point[numPoints],
          *gpuPointArray = new point[numPoints];
    cpuPointArray = (point*)malloc(numBytes);
    cudaMalloc((void**)&gpuPointArray, numBytes);

        // launch kernel
    testKernel<<<gpuGridSize,gpuBlockSize>>>(gpuPointArray);

        // retrieve the results
    cudaMemcpy(cpuPointArray, gpuPointArray, numBytes, cudaMemcpyDeviceToHost);
    printf("testKernel results:\n");
    for(int i = 0; i < numPoints; ++i)
    {
        printf("point.a: %d, point.b: %d\n",cpuPointArray[i].a,cpuPointArray[i].b);
    }

        // deallocate memory
    free(cpuPointArray);
    cudaFree(gpuPointArray);

    return 0;
}

- Paul

这里的代码 *gpuPointArray = new... 看起来有问题，您是在主机上进行了内存分配，然后又在设备上执行了 cudaMalloc 操作。 - Bart

在将内存作为参数传递给内核之前，我不需要分配内存吗？如果省略cudaMalloc行，则会出现“未指定的启动失败”错误。我也可以将gpuPointArray设置为NULL，但似乎并没有改变我的原始结果。 - Paul

4

当然，您需要使用cudaMalloc。不过，在其前面不需要使用"new"。同样的情况也适用于cpuPointArray。使用malloc和free（您正在编写C程序），不要在这里使用new（永远不要混合使用new、malloc、delete和free）。 - Bart

2个回答

6

请看一下你CUDA包含目录中的vector_types.h头文件，那应该已经给你一些指示了。

然而，这里的主要问题是你printf调用中的%d。现在你正在尝试打印浮点数，而不是整数。所以那些应该使用%f。

- Bart

好的，我已经查看了vector_types.h，并尝试着按照他们的方式进行：typedef struct __align__(2*sizeof(float)) point { ...，但仍然得到相同的结果。这里还有其他我应该注意到的地方吗？ - Paul

4

顺便说一下，把你的printf中的%d改为%f...这会有什么影响吗？现在你正在尝试打印浮点数，而不是整数... - Bart

这个回答本来可以作为一条评论。 - Stephan Dollberg

感谢@bamboon的提醒。这是一个老问题了。看来我从评论中没有将实际答案整合到回答中。现在已经做到了。这应该会使它更像一个答案。 - Bart

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Paul · Accepted Answer

由于似乎没有关于如何做到这一点的合适文档，所以我想在这里发布最终修订后的代码。事实证明，__align__ 部分也是不必要的，实际问题是在尝试打印浮点数时使用了 %d 形式的 printf。

#include <stdlib.h>
#include <stdio.h>

typedef struct
{
    float a, b;
} point;

__global__ void testKernel(point *p)
{
    int i = blockIdx.x * blockDim.x + threadIdx.x;
    p[i].a = 1.1;
    p[i].b = 2.2;
}

int main(void)
{
        // set number of points 
    int numPoints    = 16,
        gpuBlockSize = 4,
        pointSize    = sizeof(point),
        numBytes     = numPoints * pointSize,
        gpuGridSize  = numPoints / gpuBlockSize;

        // allocate memory
    point *cpuPointArray,
          *gpuPointArray;
    cpuPointArray = (point*)malloc(numBytes);
    cudaMalloc((void**)&gpuPointArray, numBytes);

        // launch kernel
    testKernel<<<gpuGridSize,gpuBlockSize>>>(gpuPointArray);

        // retrieve the results
    cudaMemcpy(cpuPointArray, gpuPointArray, numBytes, cudaMemcpyDeviceToHost);
    printf("testKernel results:\n");
    for(int i = 0; i < numPoints; ++i)
    {
        printf("point.a: %f, point.b: %f\n",cpuPointArray[i].a,cpuPointArray[i].b);
    }

        // deallocate memory
    free(cpuPointArray);
    cudaFree(gpuPointArray);

    return 0;
}