在这种情况下利用GPU的并行性的关键是让它为您管理外层循环。我们不会一次为整个数据向量调用内核,而是为数据向量中的每个元素调用它。内核函数简化为:
kernel void convolve(const device float *dataVector [[ buffer(0) ]],
const constant int &dataSize [[ buffer(1) ]],
const constant float *filterVector [[ buffer(2) ]],
const constant int &filterSize [[ buffer(3) ]],
device float *outVector [[ buffer(4) ]],
uint id [[ thread_position_in_grid ]])
{
float sum = 0.0;
for (int i = 0; i < filterSize; ++i) {
sum += dataVector[id + i] * filterVector[i];
}
outVector[id] = sum;
}
为了执行这项工作,我们根据计算管线状态建议的线程执行宽度选择线程组大小。这里唯一棘手的问题是确保输入和输出缓冲区中有足够的填充空间,以便我们可以略微超出实际数据大小。这确实会导致我们浪费少量内存和计算资源,但省去了为缓冲区末尾的元素单独计算卷积而进行单独调度的复杂性。
// We should ensure here that the data buffer and output buffer each have a size that is a multiple of
// the compute pipeline's threadExecutionWidth, by padding the amount we allocate for each of them.
// After execution, we ignore the extraneous elements in the output buffer beyond the first (dataCount - filterCount + 1).
let iterationCount = dataCount - filterCount + 1
let threadsPerThreadgroup = MTLSize(width: min(iterationCount, computePipeline.threadExecutionWidth), height: 1, depth: 1)
let threadgroups = (iterationCount + threadsPerThreadgroup.width - 1) / threadsPerThreadgroup.width
let threadgroupsPerGrid = MTLSize(width: threadgroups, height: 1, depth: 1)
let commandEncoder = commandBuffer.computeCommandEncoder()
commandEncoder.setComputePipelineState(computePipeline)
commandEncoder.setBuffer(dataBuffer, offset: 0, at: 0)
commandEncoder.setBytes(&dataCount, length: MemoryLayout<Int>.stride, at: 1)
commandEncoder.setBuffer(filterBuffer, offset: 0, at: 2)
commandEncoder.setBytes(&filterCount, length: MemoryLayout<Int>.stride, at: 3)
commandEncoder.setBuffer(outBuffer, offset: 0, at: 4)
commandEncoder.dispatchThreadgroups(threadgroupsPerGrid, threadsPerThreadgroup: threadsPerThreadgroup)
commandEncoder.endEncoding()
在我的实验中,这种并行化方法比问题中的串行版本运行速度快400-1000倍,我很想听听它与你的CPU实现相比如何。