CUDA：有没有更快的方法写入全局内存？

Question

CUDA：有没有更快的方法写入全局内存？

3

我正在编写一个n体模拟程序，基本操作如下：

-Prepare CUDA memory
 loop {
    -Copy data to CUDA
    -Launch kernel
    -Copy data to host
    -Operations using data (drawing etc.)
 }

我注意到在内核中，将数据写入全局设备内存的时间几乎占了90%。以下是内核代码：

 __global__ void calculateForcesCuda(float *deviceXpos, float *deviceYpos, float *deviceZpos,
                                    float *deviceXforces, float *deviceYforces, float *deviceZforces,
                                    float *deviceMasses, int particlesNumber) {
     int tid = threadIdx.x + blockIdx.x * blockDim.x;
     if (tid <= particlesNumber) {
         float particleXpos = deviceXpos[tid];
         float particleYpos = deviceYpos[tid];
         float particleZpos = deviceZpos[tid];
         float xForce = 0.0f;
         float yForce = 0.0f;
         float zForce = 0.0f;
         for (int index=0; index<particlesNumber; index++) {
             if (tid != index) {
                 float otherXpos = deviceXpos[index];
                 float otherYpos = deviceYpos[index];
                 float otherZpos = deviceZpos[index];
                 float mass = deviceMasses[index];
                 float distx = particleXpos - otherXpos;
                 float disty = particleYpos - otherYpos;
                 float distz = particleZpos - otherZpos;
                 float distance = sqrt((distx*distx + disty*disty + distz*distz) + 0.01f);
                 xForce += 10.0f * mass / distance * (otherXpos - particleXpos);
                 yForce += 10.0f * mass / distance * (otherYpos - particleYpos);
                 zForce += 10.0f * mass / distance * (otherZpos - particleZpos);
             }
         }
         deviceXforces[tid] += xForce;
         deviceYforces[tid] += yForce;      
         deviceZforces[tid] += zForce;
     }
 }

这个设备是GTX 970。执行时间大约为8.0秒，但在添加这些标志-gencode arch=compute_52,code=sm_52后，性能提高到大约6.7秒。在注释掉写入全局设备内存的代码之后：

deviceXforces[tid] += xForce;
deviceYforces[tid] += yForce;      
deviceZforces[tid] += zForce;

在这种情况下，总执行时间减少到约0.92秒，这意味着写入全局设备内存占用了大约86%的执行时间。是否有方法可以提高这些写操作的性能？

- xlog

2

你误解了正在发生的事情。内存写入在这段代码中并不是瓶颈。移除它们只是让编译器能够优化掉大部分代码。 - talonmies

@talonmies 天啊，你说得太对了。所以这些计算本身实际上是很慢的。我会留下这个问题，以防其他人犯同样的错误。 - xlog

1

我怀疑计算并不是问题所在。循环内的内存加载将是最大的难题。开始考虑数据重用和缓存性能。 - talonmies

2

正如talonmies所说，代码很可能受到内存限制。然而，顺便提一下：从性能上讲，此代码中执行的计算将受益于使用rnorm3d()函数。 - njuffa

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- CygnusX1 · Accepted Answer

在这种计算中，内存通常是一个瓶颈，即使它没有像你测量的那样占用90%的时间。我建议两件事。

将`device...[index]`加载到共享内存中

目前，所有线程都读取相同的deviceXpos[index]、deviceYpos[index]、deviceZpos[index]和deviceMasses[index]。相反，您可以将它们加载到共享内存中：

static const int blockSize = ....;

__shared__ float shXpos[blockSize];
__shared__ float shYpos[blockSize];
__shared__ float shZpos[blockSize];
__shared__ float shMasses[blockSize];
for (int mainIndex=0; mainIndex<particlesNumber; index+=blockSize) {
    __syncthreads(); //ensure computation from previous iteration has completed
    shXpos[threadIdx.x] = deviceXpos[mainIndex + threadIdx.x];
    shYpos[threadIdx.x] = deviceYpos[mainIndex + threadIdx.x];
    shZpos[threadIdx.x] = deviceZpos[mainIndex + threadIdx.x];
    shMasses[threadIdx.x] = deviceMasses[mainIndex + threadIdx.x];
    __syncthreads(); //ensure all data is read before computation starts
    for (int index=0; index<blockSize; ++index) {
        .... //your computation, using sh....[index] values
    }
}

这应该可以减少全局内存读取的数量，因为每个线程读取不同的数据，而不是所有线程都读取相同的内容。

但是需要注意的是，如果驱动程序正确管理L1缓存，则此建议可能不那么有效。不过还是可以尝试一下！

处理多于一个（接收）粒子的情况

您可能希望一次计算多个粒子的运动。与其只有单个 {particleX/Y/Zpos, x/y/zForce} 表示单个粒子接收力的情况不同，您可以同时处理几个粒子。

这样，在循环中只需加载一次源代码，就可以处理多个接收器。

这可能会显著减少内存压力，但同时会增加寄存器数。寄存器太多了，您将无法启动那么多线程。

检查您的线程已经拥有的寄存器数量，并咨询CUDA占用率计算器，以查看您可以使用多少寄存器。也许将占用率从1降低到0.5或0.75，但同时处理更多粒子会更有益？这需要进行实验，因为这可能因GPU而异。

CUDA：有没有更快的方法写入全局内存？

将device...[index]加载到共享内存中

将`device...[index]`加载到共享内存中