CUDA块同步化

Question

CUDA块同步化

38

我有b个块，每个块有t个线程。

我可以使用

 __syncthreads()

同步处于特定块中的线程。例如：

__global__ void aFunction()
{
    for(i=0;i<10;i++)
    {
       //execute something
        __syncthreads();
    }
}

但我的问题是如何在所有块的所有线程中同步。我该怎么做？

- user570593

2个回答

3

合作组有一些要求，比如需要通过cudaLaunchCooperativeKernel来启动您的内核。这使得它对于简单项目来说不是一个好的解决方案。

一个简单的替代方法是使用位域的原子操作，像这样：

// A global var with 64 bits can track 64 blocks, 
// use an array if you need to track more blocks
__device__ uint64_t CompleteMask; 

//This is where we put in all the smarts
//from the CPU reference solver
__global__ void doWork() {
    atomicAnd(&CompleteMask, 0);
    //do lots of work

    const auto SollMask = (1 << gridDim.x) - 1;
    if (ThreadId() == 0) {
        while ((atomicOr(&CompleteMask, 1ULL << blockIdx.x)) != SollMask) { /*do nothing*/ }
    }
    if (ThreadId() == 0 && 0 == blockIdx.x) {
        printf("Print a single line for the entire process")
    }
}

因为每个块都被分配了自己在掩码中的位，它们永远不会相互干扰。如果你有超过64个块，可以使用一个数组来跟踪位，并使用atomicAdd来跟踪计数，像这样：

// A global var with 64 bits can track 64 blocks, 
// use an array if you need to track more blocks
__device__ int CompleteMask[2];
__device__ int CompleteSuperMask;

__global__ void doWork() {
    for (auto i = 0; i < 2; i++) { atomicAnd(&CompleteMask[i], 0); }
    atomicAnd(&CompleteSuperMask, 0);
    //do lots of work

    int SollMask[3];
    SollMask[0] = -1;
    SollMask[1] = (1 << (gridDim.x % 32)) - 1;
    SollMask[2] = (1 << (gridDim.x / 32)) - 1;

    const auto b = blockIdx.x / 32;
    while (atomicOr(&CompleteMask[b], (1U << (blockIdx.x % 32))) != SollMask[b]) { /*do nothing*/ }

    while (atomicOr(&CompleteSuperMask, (1U << b)) != SollMask[2]) { /*do nothing*/ }
    if (threadIdx.x == 0 && blockIdx.x == 0) {
        printf("Print a single line for the entire process");
    }
}

当块的数量少于或等于SM的数量时，它的工作效果相当不错。然而，增加块的数量会导致执行过程冻结。这是解决方案的一个限制。

- Johan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- CygnusX1 · Accepted Answer

在CUDA 9中，NVIDIA引入了"协作组"的概念，允许您同步属于该组的所有线程。这样的组可以跨越网格中的所有线程。这样，您将能够同步所有块中的所有线程：

#include <cuda_runtime_api.h> 
#include <cuda.h> 
#include <cooperative_groups.h>

cooperative_groups::grid_group g = cooperative_groups::this_grid(); 
g.sync();

你需要一台带有Pascal架构（计算能力60）或更新架构的设备来同步网格。此外，还有更具体的要求，请参见: https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#grid-synchronization-cg 基本功能，例如将小于线程块的组同步到warp粒度以下，支持所有架构，而Pascal和Volta GPU可以启用新的全局和多GPU同步组。

来源: https://devblogs.nvidia.com/parallelforall/cuda-9-features-revealed/

在CUDA 9之前，没有本地的方法可以同步所有块中的所有线程。事实上，在CUDA中，一些块可能仅在某些其他块已经结束工作后启动，例如，如果运行它的GPU太弱以至于无法并行处理它们。

如果您确保不生成过多的块，则可以尝试使所有块之间互相同步，例如通过使用原子操作主动等待。然而，这种方法速度较慢，会占用您的GPU内存控制器，被视为“hack”应该避免。

因此，如果您不针对Pascal（或更新）架构，则我建议的最佳方法是在同步点简单地终止您的内核，然后启动一个新的内核来继续您的工作。在大多数情况下，实际上它的表现会更快（或至少与所提到的hack的速度相似）。