如何在C#中优化复制数组块?

16

我正在编写一个实时视频成像应用程序,需要加速这个方法。目前执行时间约为10毫秒,我希望将其降至2-3毫秒。

我尝试过Array.Copy和Buffer.BlockCopy,它们都需要约30毫秒的时间,比手动复制慢3倍。

一种想法是以整数形式复制4个字节,然后以整数形式粘贴它们,从而将4行代码减少为1行代码。不过,我不确定该如何做到这一点。

另一个想法是使用指针和不安全代码来完成这个操作,但我也不确定该如何做到这一点。

非常感谢任何帮助。谢谢!

编辑:数组大小为:inputBuffer [327680],lookupTable [16384],outputBuffer [1310720]

public byte[] ApplyLookupTableToBuffer(byte[] lookupTable, ushort[] inputBuffer)
{
    System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
    sw.Start();

    // Precalculate and initialize the variables
    int lookupTableLength = lookupTable.Length;
    int bufferLength = inputBuffer.Length;
    byte[] outputBuffer = new byte[bufferLength * 4];
    int outIndex = 0;
    int curPixelValue = 0;

    // For each pixel in the input buffer...
    for (int curPixel = 0; curPixel < bufferLength; curPixel++)
    {
        outIndex = curPixel * 4;                    // Calculate the corresponding index in the output buffer
        curPixelValue = inputBuffer[curPixel] * 4;  // Retrieve the pixel value and multiply by 4 since the lookup table has 4 values (blue/green/red/alpha) for each pixel value

        // If the multiplied pixel value falls within the lookup table...
        if ((curPixelValue + 3) < lookupTableLength)
        {
            // Copy the lookup table value associated with the value of the current input buffer location to the output buffer
            outputBuffer[outIndex + 0] = lookupTable[curPixelValue + 0];
            outputBuffer[outIndex + 1] = lookupTable[curPixelValue + 1];
            outputBuffer[outIndex + 2] = lookupTable[curPixelValue + 2];
            outputBuffer[outIndex + 3] = lookupTable[curPixelValue + 3];

            //System.Buffer.BlockCopy(lookupTable, curPixelValue, outputBuffer, outIndex, 4);   // Takes 2-10x longer than just copying the values manually
            //Array.Copy(lookupTable, curPixelValue, outputBuffer, outIndex, 4);                // Takes 2-10x longer than just copying the values manually
        }
    }

    Debug.WriteLine("ApplyLookupTableToBuffer(ms): " + sw.Elapsed.TotalMilliseconds.ToString("N2"));
    return outputBuffer;
}

编辑:我已经更新了方法,保持了相同的变量名称,以便其他人可以看到基于HABJAN下面的解决方案该代码如何转换。

    public byte[] ApplyLookupTableToBufferV2(byte[] lookupTable, ushort[] inputBuffer)
    {
        System.Diagnostics.Stopwatch sw = new System.Diagnostics.Stopwatch();
        sw.Start();

        // Precalculate and initialize the variables
        int lookupTableLength = lookupTable.Length;
        int bufferLength = inputBuffer.Length;
        byte[] outputBuffer = new byte[bufferLength * 4];
        //int outIndex = 0;
        int curPixelValue = 0;

        unsafe
        {
            fixed (byte* pointerToOutputBuffer = &outputBuffer[0])
            fixed (byte* pointerToLookupTable = &lookupTable[0])
            {
                // Cast to integer pointers since groups of 4 bytes get copied at once
                uint* lookupTablePointer = (uint*)pointerToLookupTable;
                uint* outputBufferPointer = (uint*)pointerToOutputBuffer;

                // For each pixel in the input buffer...
                for (int curPixel = 0; curPixel < bufferLength; curPixel++)
                {
                    // No need to multiply by 4 on the following 2 lines since the pointers are for integers, not bytes
                    // outIndex = curPixel;  // This line is commented since we can use curPixel instead of outIndex
                    curPixelValue = inputBuffer[curPixel];  // Retrieve the pixel value 

                    if ((curPixelValue + 3) < lookupTableLength)
                    {
                        outputBufferPointer[curPixel] = lookupTablePointer[curPixelValue];
                    }
                }
            }
        }

        Debug.WriteLine("2 ApplyLookupTableToBuffer(ms): " + sw.Elapsed.TotalMilliseconds.ToString("N2"));
        return outputBuffer;
    }

3
用纯C#并进行byte[]操作的实时视频成像应用?不可能......迟早会遇到性能瓶颈,所以我强烈建议学习Interop和/或C++/CLI。 - Konrad Kokosa
是的,可能需要更“低级”的操作。在这样做之前,您可以尝试在unsafe块中使用指针算术重写代码并再次测量性能,但是这里没有简单的性能保证。 - Patryk Ćwiek
1
根据我的经验,将代码转换为unsafe可以提高约10%的性能。当然,具体情况因人而异。 - Robert Harvey
请提供有关您的输入数组的信息(大小等)。 - ken2k
@PatrykĆwiek:请问您能否提供一个指针算术的示例?我想尝试一下,但不确定从哪里开始。谢谢。 - nb1forxp
@ken2k:我已经更新了问题并附上了数组大小。谢谢! - nb1forxp
1个回答

14

我进行了一些测试,通过将我的代码转换为不安全的,同时使用RtlMoveMemory API,我成功实现了最大速度。我发现Buffer.BlockCopyArray.Copy比直接使用RtlMoveMemory要慢得多。

因此,最终你将得到像这样的结果:

fixed(byte* ptrOutput= &outputBufferBuffer[0])
{
    MoveMemory(ptrOutput, ptrInput, 4);
}

[DllImport("Kernel32.dll", EntryPoint = "RtlMoveMemory", SetLastError = false)]
private static unsafe extern void MoveMemory(void* dest, void* src, int size);

编辑:

好的,现在我理解了你的逻辑并做了一些测试后,我成功地将你的方法加速了近50%。由于你需要复制小数据块(始终为4个字节),所以你是正确的,RtlMoveMemory在这里不起作用,最好将数据复制为整数。这是我想出的最终解决方案:

public static byte[] ApplyLookupTableToBufferV2(byte[] lookupTable, ushort[] inputBuffer)
{
    int lookupTableLength = lookupTable.Length;
    int bufferLength = inputBuffer.Length;
    byte[] outputBuffer = new byte[bufferLength * 4];
    int outIndex = 0, curPixelValue = 0;

    unsafe
    {
        fixed (byte* ptrOutput = &outputBuffer[0])
        fixed (byte* ptrLookup = &lookupTable[0])
        {
            uint* lkp = (uint*)ptrLookup;
            uint* opt = (uint*)ptrOutput;

            for (int index = 0; index < bufferLength; index++)
            {
                outIndex = index;
                curPixelValue = inputBuffer[index];

                if ((curPixelValue + 3) < lookupTableLength)
                {
                    opt[outIndex] = lkp[curPixelValue];
                }
            }
        }
    }

    return outputBuffer;
}

我将你的方法重新命名为ApplyLookupTableToBufferV1

以下是我的测试结果:

int tc1 = Environment.TickCount;

for (int i = 0; i < 200; i++)
{
    byte[] a = ApplyLookupTableToBufferV1(lt, ib);
}

tc1 = Environment.TickCount - tc1;

Console.WriteLine("V1: " + tc1.ToString() + "ms");

结果 - V1:998毫秒

int tc2 = Environment.TickCount;

for (int i = 0; i < 200; i++)
{
    byte[] a = ApplyLookupTableToBufferV2(lt, ib);
}

tc2 = Environment.TickCount - tc2;

Console.WriteLine("V2: " + tc2.ToString() + "ms");

结果 - V2:473毫秒


谢谢您的建议。我按照以下方式实现它: fixed (byte* ptrOutput = &outputBuffer[outIndex]) { fixed (byte* ptrInput = &lookupTable[curPixelValue]) { MoveMemory(ptrOutput, ptrInput, 4); } } 不幸的是,这种方法要慢得多;它需要40-100毫秒才能执行,平均约为60毫秒。 - nb1forxp
2
太棒了!!这正是我所期望的。非常感谢。时间从约10毫秒降至平均4.3毫秒。如果我去掉lookupTableLength检查,它会降至平均3.3毫秒。 - nb1forxp
很高兴能够帮助到您。 - HABJAN

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接