NVIDIA GPU上的指令级并行性（ILP）和乱序执行

Question

NVIDIA GPU上的指令级并行性（ILP）和乱序执行

cudanvidia

8

NVIDIA的GPU是否支持乱序执行？

我的第一反应是不支持，因为它们不包含这样昂贵的硬件。然而，在阅读CUDA编程指南时，该指南建议使用指令级并行性（ILP）来提高性能。

ILP不是一种只有支持乱序执行的硬件才能利用的特性吗？或者NVIDIA的ILP仅意味着编译器级别的指令重排，因此其顺序在运行时仍然固定。换句话说，只有编译器和/或程序员必须以这样的方式安排指令的顺序，以便通过按顺序执行在运行时实现ILP？

- user2188453

6

无序处理器不是利用指令级并行性所必需的。采用超标量执行的顺序处理器同样可以获益。 - njuffa

2个回答

1

以下代码报告了指令级并行性（ILP）的示例。

在示例中，__global__函数仅在两个数组之间执行赋值操作。当ILP=1时，我们有与数组元素数量N相同的线程，以便每个线程执行单个赋值操作。相反地，对于ILP=2的情况，我们有许多N/2个线程，每个线程处理2个元素。一般而言，对于ILP=k的情况，我们有N/k个线程，每个线程处理k个元素。

除了代码之外，下面我还报告了在NVIDIA GT920M（Kepler架构）上进行的计时，针对不同的N和ILP值。正如所见：

对于较大的N值，可达到接近GT920M显卡最大内存带宽14.4GB/s的内存带宽;
对于任何固定的N值，改变ILP的值不会改变性能。

关于第二点，我还在Maxwell上测试了相同的代码，并观察到相同的行为（对ILP没有性能变化）。要查看针对Kepler架构的效率和性能的变化，请参阅The efficiency and performance of ILP for the NVIDIA Kepler architecture中的答案，该答案还报告了Fermi架构的测试。

内存速度已通过以下公式计算：

(2.f * 4.f * N * numITER) / (1e9 * timeTotal * 1e-3)

where

4.f * N * numITER

是读取或写入的数量，

2.f * 4.f * N * numITER

是读取和写入的数量，

timeTotal * 1e-3

这是以秒为单位的时间（timeTotal以毫秒为单位）。代码

// --- GT920m - 14.4 GB/s
//     http://gpuboss.com/gpus/GeForce-GTX-280M-vs-GeForce-920M

#include<stdio.h>
#include<iostream>

#include "Utilities.cuh"
#include "TimingGPU.cuh"

#define BLOCKSIZE    32

#define DEBUG

/****************************************/
/* INSTRUCTION LEVEL PARALLELISM KERNEL */
/****************************************/
__global__ void ILPKernel(const int * __restrict__ d_a, int * __restrict__ d_b, const int ILP, const int N) {

    const int tid = threadIdx.x + blockIdx.x * blockDim.x * ILP;

    if (tid >= N) return;

    for (int j = 0; j < ILP; j++) d_b[tid + j * blockDim.x] = d_a[tid + j * blockDim.x];

}

/********/
/* MAIN */
/********/
int main() {

    //const int N = 8192;
    const int N = 524288 * 32;
    //const int N = 1048576;
    //const int N = 262144;
    //const int N = 2048;

    const int numITER = 100;

    const int ILP = 16;

    TimingGPU timerGPU;

    int *h_a = (int *)malloc(N * sizeof(int));
    int *h_b = (int *)malloc(N * sizeof(int));

    for (int i = 0; i<N; i++) {
        h_a[i] = 2;
        h_b[i] = 1;
    }

    int *d_a; gpuErrchk(cudaMalloc(&d_a, N * sizeof(int)));
    int *d_b; gpuErrchk(cudaMalloc(&d_b, N * sizeof(int)));

    gpuErrchk(cudaMemcpy(d_a, h_a, N * sizeof(int), cudaMemcpyHostToDevice));
    gpuErrchk(cudaMemcpy(d_b, h_b, N * sizeof(int), cudaMemcpyHostToDevice));

    /**************/
    /* ILP KERNEL */
    /**************/
    float timeTotal = 0.f;
    for (int k = 0; k < numITER; k++) {
        timerGPU.StartCounter();
        ILPKernel << <iDivUp(N / ILP, BLOCKSIZE), BLOCKSIZE >> >(d_a, d_b, ILP, N);
#ifdef DEBUG
        gpuErrchk(cudaPeekAtLastError());
        gpuErrchk(cudaDeviceSynchronize());
#endif
        timeTotal = timeTotal + timerGPU.GetCounter();
    }

    printf("Bandwidth = %f GB / s; Num blocks = %d\n", (2.f * 4.f * N * numITER) / (1e6 * timeTotal), iDivUp(N / ILP, BLOCKSIZE));
    gpuErrchk(cudaMemcpy(h_b, d_b, N * sizeof(int), cudaMemcpyDeviceToHost));
    for (int i = 0; i < N; i++) if (h_a[i] != h_b[i]) { printf("Error at i = %i for kernel0! Host = %i; Device = %i\n", i, h_a[i], h_b[i]); return 1; }

    return 0;

}

性能

GT 920M
N = 512  - ILP = 1  - BLOCKSIZE = 512 (1 block  - each block processes 512 elements)  - Bandwidth = 0.092 GB / s

N = 1024 - ILP = 1  - BLOCKSIZE = 512 (2 blocks - each block processes 512 elements)  - Bandwidth = 0.15  GB / s

N = 2048 - ILP = 1  - BLOCKSIZE = 512 (4 blocks - each block processes 512 elements)  - Bandwidth = 0.37  GB / s
N = 2048 - ILP = 2  - BLOCKSIZE = 256 (4 blocks - each block processes 512 elements)  - Bandwidth = 0.36  GB / s
N = 2048 - ILP = 4  - BLOCKSIZE = 128 (4 blocks - each block processes 512 elements)  - Bandwidth = 0.35  GB / s
N = 2048 - ILP = 8  - BLOCKSIZE =  64 (4 blocks - each block processes 512 elements)  - Bandwidth = 0.26  GB / s
N = 2048 - ILP = 16 - BLOCKSIZE =  32 (4 blocks - each block processes 512 elements)  - Bandwidth = 0.31  GB / s

N = 4096 - ILP = 1  - BLOCKSIZE = 512 (8 blocks - each block processes 512 elements)  - Bandwidth = 0.53  GB / s
N = 4096 - ILP = 2  - BLOCKSIZE = 256 (8 blocks - each block processes 512 elements)  - Bandwidth = 0.61  GB / s
N = 4096 - ILP = 4  - BLOCKSIZE = 128 (8 blocks - each block processes 512 elements)  - Bandwidth = 0.74  GB / s
N = 4096 - ILP = 8  - BLOCKSIZE =  64 (8 blocks - each block processes 512 elements)  - Bandwidth = 0.74  GB / s
N = 4096 - ILP = 16 - BLOCKSIZE =  32 (8 blocks - each block processes 512 elements)  - Bandwidth = 0.56  GB / s

N = 8192 - ILP = 1  - BLOCKSIZE = 512 (16 blocks - each block processes 512 elements) - Bandwidth = 1.4  GB / s
N = 8192 - ILP = 2  - BLOCKSIZE = 256 (16 blocks - each block processes 512 elements) - Bandwidth = 1.1  GB / s
N = 8192 - ILP = 4  - BLOCKSIZE = 128 (16 blocks - each block processes 512 elements) - Bandwidth = 1.5  GB / s
N = 8192 - ILP = 8  - BLOCKSIZE =  64 (16 blocks - each block processes 512 elements) - Bandwidth = 1.4  GB / s
N = 8192 - ILP = 16 - BLOCKSIZE =  32 (16 blocks - each block processes 512 elements) - Bandwidth = 1.3  GB / s

...

N = 16777216 - ILP = 1  - BLOCKSIZE = 512 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.9  GB / s
N = 16777216 - ILP = 2  - BLOCKSIZE = 256 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.8  GB / s
N = 16777216 - ILP = 4  - BLOCKSIZE = 128 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.8  GB / s
N = 16777216 - ILP = 8  - BLOCKSIZE =  64 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.7  GB / s
N = 16777216 - ILP = 16 - BLOCKSIZE =  32 (32768 blocks - each block processes 512 elements) - Bandwidth = 12.6  GB / s

- Vitality

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Superspr · Accepted Answer

流水线是一种常见的ILP技术，肯定已经在NVidia的GPU上实现了。我想你同意流水线不依赖于乱序执行。此外，从计算能力2.0开始，NVidia GPU拥有多个warp调度程序（2或4）。如果您的代码在线程中具有2个（或更多）连续且独立的指令（或编译器以某种方式重新排序），则也可以利用调度程序的ILP。

以下是有关2-wide warp scheduler + pipelining如何共同工作的问题的详细解释。nVIDIA CC 2.1 GPU warp schedulers如何为warp同时发出2条指令？此外，请查看Vasily Volkov在GTC 2010上的演示。他实验性地发现了ILP如何提高CUDA代码性能。http://www.cs.berkeley.edu/~volkov/volkov10-GTC.pdf 就GPU上的乱序执行而言，我认为不会有这种情况。硬件指令重排序、推测执行等所有这些东西在每个SM上实现都太昂贵了，正如您所知。而线程级并行性可以填补缺少乱序执行的空白。当遇到真正的依赖性时，其他warp可以介入并填充管道。