我正在编写一个n体模拟程序,基本操作如下:
-Prepare CUDA memory
loop {
-Copy data to CUDA
-Launch kernel
-Copy data to host
-Operations using data (drawing etc.)
}
我注意到在内核中,将数据写入全局设备内存的时间几乎占了90%。以下是内核代码:
__global__ void calculateForcesCuda(float *deviceXpos, float *deviceYpos, float *deviceZpos,
float *deviceXforces, float *deviceYforces, float *deviceZforces,
float *deviceMasses, int particlesNumber) {
int tid = threadIdx.x + blockIdx.x * blockDim.x;
if (tid <= particlesNumber) {
float particleXpos = deviceXpos[tid];
float particleYpos = deviceYpos[tid];
float particleZpos = deviceZpos[tid];
float xForce = 0.0f;
float yForce = 0.0f;
float zForce = 0.0f;
for (int index=0; index<particlesNumber; index++) {
if (tid != index) {
float otherXpos = deviceXpos[index];
float otherYpos = deviceYpos[index];
float otherZpos = deviceZpos[index];
float mass = deviceMasses[index];
float distx = particleXpos - otherXpos;
float disty = particleYpos - otherYpos;
float distz = particleZpos - otherZpos;
float distance = sqrt((distx*distx + disty*disty + distz*distz) + 0.01f);
xForce += 10.0f * mass / distance * (otherXpos - particleXpos);
yForce += 10.0f * mass / distance * (otherYpos - particleYpos);
zForce += 10.0f * mass / distance * (otherZpos - particleZpos);
}
}
deviceXforces[tid] += xForce;
deviceYforces[tid] += yForce;
deviceZforces[tid] += zForce;
}
}
这个设备是GTX 970。执行时间大约为8.0秒,但在添加这些标志-gencode arch=compute_52,code=sm_52后,性能提高到大约6.7秒。在注释掉写入全局设备内存的代码之后:
deviceXforces[tid] += xForce;
deviceYforces[tid] += yForce;
deviceZforces[tid] += zForce;
在这种情况下,总执行时间减少到约0.92秒,这意味着写入全局设备内存占用了大约86%的执行时间。是否有方法可以提高这些写操作的性能?
rnorm3d()
函数。 - njuffa