我的新解决方案是避免与设备同步。我已经研究了一些检索时间戳的方法,结果看起来不错,我相当确定比较是足够公平的。我比较了我的CUDA时间(事件记录 vs. QPC),差异很小,似乎是一个恒定的开销。
CUDA Event Host QPC
4,6 30,0
4,8 30,0
5,0 31,0
5,2 32,0
5,6 34,0
6,1 34,0
6,9 31,0
8,3 47,0
9,2 34,0
12,0 39,0
16,7 46,0
20,5 55,0
32,1 69,0
48,5 111,0
86,0 134,0
182,4 237,0
419,0 473,0
如果我的问题引起了某个人想要找到如何进行gpgpu基准测试的希望,我将留下一些代码,展示我当前的基准测试策略。
代码示例,CUDA。
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
float milliseconds = 0;
cudaEventRecord(start);
...
...
cudaEventRecord(stop);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&milliseconds, start, stop);
OpenCL
cl_event start_event, end_event;
cl_ulong start = 0, end = 0;
clEnqueueNDRangeKernel(..., &start_event);
...
...
clEnqueueNDRangeKernel(..., &end_event);
clWaitForEvents(1, &end_event);
clGetEventProfilingInfo(start_event, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, NULL);
clGetEventProfilingInfo(end_event, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, NULL);
timeInMS = (double)(end - start)*(double)(1e-06);
DirectCompute
这里我遵循了Adam Miles的建议,并研究了该源代码。看起来会是这样的:
ID3D11Device* device = nullptr;
...
...
ID3D11QueryPtr disjoint_query;
ID3D11QueryPtr q_start;
ID3D11QueryPtr q_end;
...
if (disjoint_query == NULL)
{
D3D11_QUERY_DESC desc;
desc.Query = D3D11_QUERY_TIMESTAMP_DISJOINT;
desc.MiscFlags = 0;
device->CreateQuery(&desc, &disjoint_query);
desc.Query = D3D11_QUERY_TIMESTAMP;
device->CreateQuery(&desc, &q_start);
device->CreateQuery(&desc, &q_end);
}
context->Begin(disjoint_query);
context->End(q_start);
...
...
context->End(q_end);
context->End(disjoint_query);
UINT64 start, end;
D3D11_QUERY_DATA_TIMESTAMP_DISJOINT q_freq;
while (S_OK != context->GetData(q_start, &start, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(q_end, &end, sizeof(UINT64), 0)){};
while (S_OK != context->GetData(disjoint_query, &q_freq, sizeof(D3D11_QUERY_DATA_TIMESTAMP_DISJOINT), 0)){};
timeInMS = (((double)(end - start)) / ((double)q_freq.Frequency)) * 1000.0;
C/C++/OpenMP
static LARGE_INTEGER StartingTime, EndingTime, ElapsedMicroseconds, Frequency;
static void __inline startTimer()
{
QueryPerformanceFrequency(&Frequency);
QueryPerformanceCounter(&StartingTime);
}
static double __inline stopTimer()
{
QueryPerformanceCounter(&EndingTime);
ElapsedMicroseconds.QuadPart = EndingTime.QuadPart - StartingTime.QuadPart;
ElapsedMicroseconds.QuadPart *= 1000000;
ElapsedMicroseconds.QuadPart /= Frequency.QuadPart;
return (double)ElapsedMicroseconds.QuadPart;
}
我的代码示例是摘自上下文,我尝试进行了一些清理,但可能存在错误。