我已经开始学习CUDA一段时间了,我遇到了以下问题:
请看下面的操作步骤: 复制GPU
复制CPU
请看下面的操作步骤: 复制GPU
int* B;
// ...
int *dev_B;
//initialize B=0
cudaMalloc((void**)&dev_B, Nel*Nface*sizeof(int));
cudaMemcpy(dev_B, B, Nel*Nface*sizeof(int),cudaMemcpyHostToDevice);
//...
//Execute on GPU the following function which is supposed to fill in
//the dev_B matrix with integers
findNeiborElem <<< Nblocks, Nthreads >>>(dev_B, dev_MSH, dev_Nel, dev_Npel, dev_Nface, dev_FC);
复制CPU
cudaMemcpy(B, dev_B, Nel*Nface*sizeof(int),cudaMemcpyDeviceToHost);
- Copying array B to dev_B takes only a fraction of a second. However copying array dev_B back to B takes forever.
The findNeiborElem function involves a loop for each thread e.g. it looks like that
__ global __ void findNeiborElem(int *dev_B, int *dev_MSH, int *dev_Nel, int *dev_Npel, int *dev_Nface, int *dev_FC){ int tid=threadIdx.x + blockIdx.x * blockDim.x; while (tid<dev_Nel[0]){ for (int j=1;j<=Nel;j++){ // do some calculations B[ind(tid,1,Nel)]=j// j in most cases do no go all the way to the Nel reach break; } tid += blockDim.x * gridDim.x; } }
非常奇怪的是,将dev_B复制到B所需的时间与j索引的迭代次数成正比。
例如,如果Nel=5,则时间约为5秒。
当我增加Nel = 20时,时间大约为20秒。
我本来期望复制时间应该独立于内部迭代次数,只需要分配Matrix dev_B的值即可。
同时,我也期望从CPU复制相同的矩阵的时间与之相同。
您有任何想法出了什么问题吗?