在GPU上对于数据独立问题，每个元素启动1个线程始终是最优的吗？

Question

在GPU上对于数据独立问题，每个元素启动1个线程始终是最优的吗？

3

我正在编写一个简单的memcpy内核，以测量我的GTX 760M的内存带宽，并将其与cudaMemcpy()进行比较。代码如下：

template<unsigned int THREADS_PER_BLOCK>
__global__ static
void copy(void* src, void* dest, unsigned int size) {
    using vector_type = int2;
    vector_type* src2 = reinterpret_cast<vector_type*>(src);
    vector_type* dest2 = reinterpret_cast<vector_type*>(dest);

    //This copy kernel is only correct when size%sizeof(vector_type)==0
    auto numElements = size / sizeof(vector_type);

    for(auto id = THREADS_PER_BLOCK * blockIdx.x + threadIdx.x; id < numElements ; id += gridDim.x * THREADS_PER_BLOCK){
        dest2[id] = src2[id];
    }
}

我还计算了达到100%占用所需的块数，方法如下：

THREADS_PER_BLOCK = 256 
Multi-Processors: 4 
Max Threads per Multi Processor: 2048 
NUM_BLOCKS = 4 * 2048 / 256 = 32

然而，我的测试表明，启动足够的块，以便每个线程只处理一个元素，始终优于“最佳”块计数。以下是400mb数据的时间：

bandwidth test by copying 400mb of data.
cudaMemcpy finished in 15.63ms. Bandwidth: 51.1838 GB/s
thrust::copy finished in 15.7218ms. Bandwidth: 50.8849 GB/s
my memcpy (195313 blocks) finished in 15.6208ms. Bandwidth: 51.2137 GB/s
my memcpy (32 blocks) finished in 16.8083ms. Bandwidth: 47.5956 GB/s

所以我的问题是：

为什么会有速度差异？

当每个元素可以完全独立处理时，逐个元素启动一个线程是否存在任何缺点？

- dari

每个元素启动一个线程对于数据独立问题总是最优的吗？- 不是的。想象一下有1000个独立的元素，在一个4核心的机器上启动1000个线程，你将会被线程/任务切换开销淹没。 - undefined

只是为了确保：你是否进行了多次测量并查看了它们的运行时间/带宽的平均值？基准测试在单次运行之间往往会有很大的差异。 - undefined

@BenSteffan 是的，时间变化小于+-0.1毫秒。 - undefined

3

这个问题针对的是具有几乎零线程创建时间和无线程切换/交换的现代图形处理器。 - undefined

2个回答

0

一句话回答：当你为每个元素创建一个线程时，你需要支付线程设置的成本——至少是将参数从常量内存复制到寄存器——对于每个元素都要进行这样的操作，这是很浪费的。

- einpoklum

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Robert Crovella · Accepted Answer

在GPU上，对于数据独立问题，每个元素启动1个线程是否始终最优？

并不总是。让我们考虑3种不同的实现方式。在每种情况下，我们假设处理的是一个可以轻松并行化的问题，该问题涉及每个线程加载一个元素、执行一些“工作”和存储一个元素。在你的复制示例中，基本上没有什么工作 - 只有加载和存储。

One element per thread. Each thread is doing 1 element load, the work, and 1 store. The GPU likes to have a lot of exposed parallel-issue-capable instructions per thread available, in order to hide latency. Your example consists of one load and one store per thread, ignoring other instructions like index arithmetic, etc. In your example GPU, you have 4 SMs, and each is capable of a maximum complement of 2048 threads (true for nearly all GPUs today), so the maximum in-flight complement is 8192 threads. So at most, 8192 loads can be issued to the memory pipe, then we're going to hit machine stalls until that data comes back from memory, so that the corresponding store instructions can be issued. In addition, for this case, we have overhead associated with retiring threadblocks and launching new threadblocks, since each block only handles 256 elements.
Multiple elements per thread, not known at compile time. In this case, we have a loop. The compiler does not know the loop extent at compile time, so it may or may not unroll the the loop. If it does not unroll the loop, then we have a load followed by a store per each loop iteration. This doesn't give the compiler a good opportunity to reorder (independent) instructions, so the net effect may be the same as case 1 except that we have some additional overhead associated with processing the loop.
Multiple elements per thread, known at compile time. You haven't really provided this example, but it is often the best scenario. In the parallelforall blog matrix transpose example, the writer of that essentially copy kernel chose to have each thread perform 8 elements of copy "work". The compiler then sees a loop:
```
  LOOP:  LD R0, in[idx];
         ST out[idx], R0;
         ...
         BRA  LOOP;
```
which it can unroll (let's say) 8 times:
```
     LD R0, in[idx];
     ST out[idx], R0;
     LD R0, in[idx+1];
     ST out[idx+1], R0;
     LD R0, in[idx+2];
     ST out[idx+2], R0;
     LD R0, in[idx+3];
     ST out[idx+3], R0;
     LD R0, in[idx+4];
     ST out[idx+4], R0;
     LD R0, in[idx+5];
     ST out[idx+5], R0;
     LD R0, in[idx+6];
     ST out[idx+6], R0;
     LD R0, in[idx+7];
     ST out[idx+7], R0;
```
and after that it can reorder the instructions, since the operations are independent:
```
     LD R0, in[idx];
     LD R1, in[idx+1];
     LD R2, in[idx+2];
     LD R3, in[idx+3];
     LD R4, in[idx+4];
     LD R5, in[idx+5];
     LD R6, in[idx+6];
     LD R7, in[idx+7];
     ST out[idx], R0;
     ST out[idx+1], R1;
     ST out[idx+2], R2;
     ST out[idx+3], R3;
     ST out[idx+4], R4;
     ST out[idx+5], R5;
     ST out[idx+6], R6;
     ST out[idx+7], R7;
```
at the expense of some increased register pressure. The benefit here, as compared to the non-unrolled loop case, is that the first 8 LD instructions can all be issued - they are all independent. After issuing those, the thread will stall at the first ST instruction - until the corresponding data is actually returned from global memory. In the non-unrolled case, the machine can issue the first LD instruction, but immediately hits a dependent ST instruction, and so it may stall right there. The net of this is that in the first 2 scenarios, I was only able to have 8192 LD operations in flight to the memory subsystem, but in the 3rd case I was able to have 65536 LD instructions in flight. Does this provide a benefit? In some cases, it does. The benefit will vary depending on which GPU you are running on.

我们所做的是有效地（与编译器一起工作）增加了每个线程可以发出的指令数量，而不会遇到停顿。这也被称为通过ILP在这种方法中增加了公开的并行性。它是否有任何好处将取决于您的实际代码、实际GPU以及GPU此时正在进行的其他操作。但使用诸如此类的技术增加公开的并行性始终是一种很好的策略，因为发出指令的能力是GPU隐藏必须处理的各种延迟的方式，因此我们有效地改善了GPU利用这种方法隐藏延迟的能力。

“为什么会有速度差异？”

没有仔细分析代码，这个问题很难回答。然而，仅启动足够的线程来完全满足GPU瞬时承载能力通常不是一个好的策略，可能由于“尾效应”或其他类型的低效率。还可能存在其他限制块的因素，例如寄存器或共享内存使用情况。通常需要仔细分析以及可能研究生成的机器代码才能完全回答这样的问题。但是，循环开销可能会对比较产生明显的影响，这基本上是我的情况2与我的情况1之间的情况。

（请注意，我“伪”机器代码示例中的内存索引不是为了编写良好的网格步进复制循环而期望的 - 它们只是为了演示展开和通过编译器指令重新排序带来的好处。）