C++和CUDA：为什么代码每次返回不同的结果？

Question

C++和CUDA：为什么代码每次返回不同的结果？

5

更新：我找到了这个错误。由于我之前发布的代码非常复杂，所以我对它们进行了简化，并只保留了问题出现的部分。

if (number >= dim * num_points)
    return;

但实际上，我只有num_points个数据点，我想使用num_points个线程，因此正确的方法应该是：

if (number >= num_points)
    return;

感谢大家的帮助。

我正在将一些C++代码从CPU移植到GPU。下面是代码，抱歉它很长，但我认为通过这种方式可以更容易地检测问题。

在代码中，对于每个线程，我需要一些矩阵格式的中间结果，因此我为这些中间结果分配了设备内存，例如d_dir2、d_R、d_Stick、d_PStick。结果并不像我期望的那样，所以为了调试，我尝试以这种方式输出一些中间结果R：

if (k == 0)
 {
 results[tmp_int1 + i * dim + j] = R[tmp_int1 + i * dim + j];
 }

然后在C++中，我打印了results。

然而，我发现每次results给出不同的值。有时它会给出正确的答案R，有时是PStick的值，有时是R和PStick的组合，有时是R和0的组合（results在开头被初始化为0）。

我非常困惑是什么导致了这个问题。有什么想法吗？非常感谢 :)

__global__ void stickvote(const int dim, const int num_points, const int gridx, float Sigma, float* input, float* dir2, float* R, float* Stick, float* PStick, float* results) {
  float threshold = 4 * Sigma;
  float c = (- 16 * log(0.1f) * (sqrt(Sigma) - 1)) / 3.1415926f / 3.1415926f;

  int row = blockIdx.y * blockDim.y + threadIdx.y;
  int col = blockIdx.x * blockDim.x + threadIdx.x;
  int number = row * BLOCK_SIZE * gridx + col;

  if (number >= dim * num_points)  //// The bug is here!
    return;
}


extern "C" void KernelStickVote(int dim, int num_points, float Sigma, float* input, float* results) {
  const int totalpoints = num_points;
  const int totalpoints_input = (dim + 1)* (dim + 1) * num_points;
  const int totalpoints_output = dim * dim * num_points;
  size_t size_input = totalpoints_input * sizeof(float);
  size_t size_output = totalpoints_output * sizeof(float);

  float* d_input;
  cutilSafeCall(cudaMalloc((void**)&d_input, size_input));

  float* d_result;
  cutilSafeCall(cudaMalloc((void**)&d_result, size_output));

  // used to save dir, and calculate dir * dir'
  float* d_dir2;
  cutilSafeCall(cudaMalloc((void**)&d_dir2, dim * num_points * sizeof(float)));

  // used to save R: dim * dim * N
  float* d_R;
  cutilSafeCall(cudaMalloc((void**)&d_R, size_output));

  // used to save Stick: dim * dim * N
  float* d_Stick;
  cutilSafeCall(cudaMalloc((void**)&d_Stick, size_output));

  // used to save Stick: dim * dim * N
  float* d_PStick;
  cutilSafeCall(cudaMalloc((void**)&d_PStick, size_output));

  // Copy input data from host to device
  cudaMemcpy(d_input, input, size_input, cudaMemcpyHostToDevice);

  int totalblock = (totalpoints % BLOCKPOINTS==0 ? totalpoints/BLOCKPOINTS : (int(totalpoints/BLOCKPOINTS) + 1));
  int gridx = (65535 < totalblock ? 65535 : totalblock);
  int gridy = (totalblock % gridx == 0 ? totalblock/gridx : (int(totalblock/gridx)+1) );
  dim3 dimBlock(BLOCK_SIZE, BLOCK_SIZE);
  dim3 dimGrid(gridx, gridy);

  stickvote<<<dimGrid, dimBlock>>>(dim, num_points, gridx, Sigma, d_input, d_dir2, d_R, d_Stick, d_PStick, d_result);
  cudaMemcpy(results, d_result, size_output, cudaMemcpyDeviceToHost);

  cudaFree(d_input);
  cudaFree(d_result);
  cudaFree(d_dir2);
  cudaFree(d_R);
  cudaFree(d_Stick);
  cudaFree(d_PStick);
}

- user1834981

3

你觉得为什么在这堆代码（虽然看起来不错，但确实很多）中检测问题要比在漂亮的sscce中容易？请注意，我需要优化并保留原意，不能添加解释或其他内容。 - leftaroundabout

1

您的API错误检查不完整，您确定内核是否正在运行吗？ - talonmies

给 leftaroundabout：感谢您的评论，我明白如果它以简洁的格式呈现会更清晰。但问题是我不确定哪部分是安全的。因此，如果我压缩某些内容，可能会擦除真正的问题。抱歉给您带来麻烦。 - user1834981

1

请检查所有 API 调用的返回值，特别是 kernel 启动后的 cudaMemcpy，以及其他所有调用。如 @leftaroundabout 所说，创建一个更短的示例将有所帮助，至少应该可以编译和执行。 - Tom

我再次检查了一遍，发现我用于dir2的cudamemcpy是错误的，在修改后所有中间结果都是正确的。但它们有时是正确的，有时是错误的。仍然感到困惑。 - user1834981

显示剩余3条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- talonmies · Accepted Answer

原帖作者自行进行了一些代码简化和调试，并发现内核中的守卫语句：

if (number >= dim * num_points)
    return;

事实上，这是错误的，应该是：

。

if (number >= num_points)
    return;

这是错误的源头。

这个答案已经被添加为社区wiki答案，旨在从未回答的队列中删除此问题。