固定内存旨在增加从主机到设备的传输速率(api 参考文献)。然而,我发现我不需要调用cuMemcpyHtoD使内核可以访问值,或者调用cuMemcpyDtoA使主机可以读取值。我认为这种方法不会起作用,但实际上它确实有效:
__global__ void testPinnedMemory(double * mem)
{
double currentValue = mem[threadIdx.x];
printf("Thread id: %d, memory content: %f\n", threadIdx.x, currentValue);
mem[threadIdx.x] = currentValue+10;
}
void test()
{
const size_t THREADS = 8;
double * pinnedHostPtr;
cudaHostAlloc((void **)&pinnedHostPtr, THREADS, cudaHostAllocDefault);
//set memory values
for (size_t i = 0; i < THREADS; ++i)
pinnedHostPtr[i] = i;
//call kernel
dim3 threadsPerBlock(THREADS);
dim3 numBlocks(1);
testPinnedMemory<<< numBlocks, threadsPerBlock>>>(pinnedHostPtr);
//read output
printf("Data after kernel execution: ");
for (int i = 0; i < THREADS; ++i)
printf("%f ", pinnedHostPtr[i]);
printf("\n");
}
输出:
Data after kernel execution: 10.000000 11.000000 12.000000 13.000000 14.000000 15.000000 16.000000 17.000000
Thread id: 0, memory content: 0.000000
Thread id: 1, memory content: 1.000000
Thread id: 2, memory content: 2.000000
Thread id: 3, memory content: 3.000000
Thread id: 4, memory content: 4.000000
Thread id: 5, memory content: 5.000000
Thread id: 6, memory content: 6.000000
Thread id: 7, memory content: 7.000000
我的问题是:
- 固定内存是否是零拷贝?我认为只有映射的固定内存才是零拷贝。
- 如果它是零拷贝,为什么还需要一种显式的方式将其映射到设备(cudaHostAlloc使用cudaHostAllocMapped选项)
我正在使用CUDA Toolkit 5.5,Quadro 4000驱动程序设置为TCC模式,并编译选项sm_20,compute_20