金属计算管线速度极慢

4
我看到使用Metal计算管道来提高应用程序性能的机会。然而,我的初步测试显示计算管道非常缓慢(至少在旧设备上)。
因此,我进行了一个样例项目,比较计算管道和渲染管道的性能。该程序将一个2048 x 2048的源纹理转换为灰度图像,并输出到目标纹理中。
在iPhone 5S上,片段着色器需要3毫秒完成转换。然而,计算内核需要177毫秒完成相同的操作。这是59倍的时间差!

enter image description here

你在老设备上使用计算管线的经验如何?它不会非常缓慢吧?
以下是我的代码片段和计算函数:
// Grayscale Fragment Function
fragment half4 grayscaleFragment(RasterizerData in [[stage_in]],
                                 texture2d<half> inTexture [[texture(0)]])
{
    constexpr sampler textureSampler;

    half4 inColor  = inTexture.sample(textureSampler, in.textureCoordinate);
    half  gray     = dot(inColor.rgb, kRec709Luma);
    return half4(gray, gray, gray, 1.0);
}


// Grayscale Kernel Function
kernel void grayscaleKernel(uint2 gid [[thread_position_in_grid]],
                            texture2d<half, access::read> inTexture [[texture(0)]],
                            texture2d<half, access::write> outTexture [[texture(1)]])
{
    half4 inColor  = inTexture.read(gid);
    half  gray     = dot(inColor.rgb, kRec709Luma);
    outTexture.write(half4(gray, gray, gray, 1.0), gid);
}

计算和渲染方法

- (void)compute {

    id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];

    // Compute encoder
    id<MTLComputeCommandEncoder> computeEncoder = [commandBuffer computeCommandEncoder];
    [computeEncoder setComputePipelineState:_computePipelineState];
    [computeEncoder setTexture:_srcTexture atIndex:0];
    [computeEncoder setTexture:_dstTexture atIndex:1];
    [computeEncoder dispatchThreadgroups:_threadgroupCount threadsPerThreadgroup:_threadgroupSize];
    [computeEncoder endEncoding];

    [commandBuffer commit];

    [commandBuffer waitUntilCompleted];
}


- (void)render {

    id<MTLCommandBuffer> commandBuffer = [_commandQueue commandBuffer];

    // Render pass descriptor
    MTLRenderPassDescriptor *renderPassDescriptor = [MTLRenderPassDescriptor renderPassDescriptor];
    renderPassDescriptor.colorAttachments[0].loadAction = MTLLoadActionDontCare;
    renderPassDescriptor.colorAttachments[0].texture = _dstTexture;
    renderPassDescriptor.colorAttachments[0].storeAction = MTLStoreActionStore;

    // Render encoder
    id<MTLRenderCommandEncoder> renderEncoder = [commandBuffer renderCommandEncoderWithDescriptor:renderPassDescriptor];
    [renderEncoder setRenderPipelineState:_renderPipelineState];
    [renderEncoder setFragmentTexture:_srcTexture atIndex:0];
    [renderEncoder drawPrimitives:MTLPrimitiveTypeTriangleStrip vertexStart:0 vertexCount:4];
    [renderEncoder endEncoding];

    [commandBuffer commit];

    [commandBuffer waitUntilCompleted];
}

和 Metal 设置:

- (void)setupMetal
{
    // Get metal device
    _device = MTLCreateSystemDefaultDevice();

    // Create the command queue
    _commandQueue = [_device newCommandQueue];

    id<MTLLibrary> defaultLibrary = [_device newDefaultLibrary];

    // Create compute pipeline state
    _computePipelineState = [_device newComputePipelineStateWithFunction:[defaultLibrary newFunctionWithName:@"grayscaleKernel"] error:nil];

    // Create render pipeline state
    MTLRenderPipelineDescriptor *pipelineStateDescriptor = [[MTLRenderPipelineDescriptor alloc] init];
    pipelineStateDescriptor.vertexFunction = [defaultLibrary newFunctionWithName:@"vertexShader"];
    pipelineStateDescriptor.fragmentFunction = [defaultLibrary newFunctionWithName:@"grayscaleFragment"];
    pipelineStateDescriptor.colorAttachments[0].pixelFormat = MTLPixelFormatBGRA8Unorm;
    _renderPipelineState = [_device newRenderPipelineStateWithDescriptor:pipelineStateDescriptor error:nil];

    // Create source and destination texture descriptor
    // Since the compute kernel function doesn't check if pixels are within the bounds of the destination texture, make sure texture width
    // and height are multiples of the pipeline threadExecutionWidth and (threadExecutionWidth / maxTotalThreadsPerThreadgroup) respectivly.
    MTLTextureDescriptor *textureDescriptor = [MTLTextureDescriptor texture2DDescriptorWithPixelFormat:MTLPixelFormatBGRA8Unorm
                                                                                                 width:2048
                                                                                                height:2048
                                                                                             mipmapped:NO];
    // Create source texture
    textureDescriptor.usage = MTLTextureUsageShaderRead;
    _srcTexture = [_device newTextureWithDescriptor:textureDescriptor];

    // Create description texture
    textureDescriptor.usage = MTLTextureUsageShaderWrite | MTLTextureUsageRenderTarget;
    _dstTexture = [_device newTextureWithDescriptor:textureDescriptor];

    // Set the compute kernel's threadgroup size
    NSUInteger threadWidth = _computePipelineState.threadExecutionWidth;
    NSUInteger threadMax = _computePipelineState.maxTotalThreadsPerThreadgroup;
    _threadgroupSize = MTLSizeMake(threadWidth, threadMax / threadWidth, 1);

     // Set the compute kernel's threadgroup count
    _threadgroupCount.width  = (_srcTexture.width  + _threadgroupSize.width -  1) / _threadgroupSize.width;
    _threadgroupCount.height = (_srcTexture.height + _threadgroupSize.height - 1) / _threadgroupSize.height;
    _threadgroupCount.depth = 1;
}

只是提供信息,当进行渲染时,您通常不会像那样调用waitUntilCompleted,仅以异步方式进行渲染,您仍然可以在GPU捕获界面中检查执行时间。 - MoDJ
1个回答

4
金属计算管道在A7级别的CPU / GPU设备上无法使用。相同的计算管道在A8和更新的设备上具有很好的性能。您处理此问题的选项是为A7设备创建片段着色器实现,并针对所有更新的设备使用计算逻辑,或者将计算导出到该设备类别中的至少2个CPU上。您也可以仅对所有设备使用所有片段着色器,但是计算内核可以在复杂代码的情况下获得更好的性能,因此这是需要考虑的一些内容。

1
感谢确认!在最新的设备上,两个管线执行的时间相同,但即使在iPhone 7 Plus上,片段仍然比计算管线快两倍。 :-( 我很想知道哪些情况下计算管线更快。 - Yoshi
1
计算内核的速度取决于执行的操作和顺序,因此可以显著提高性能。以我在Metal中实现的Rice熵解码器为例,这个计算内核的运行速度比之前基于片段着色器的实现快了约2倍,因为可以在像素边界上缓存中间结果。请参考我的GitHub链接:https://github.com/mdejong/MetalRice - MoDJ

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接