快速方法复制内存并进行转换-从ARGB到BGR

Question

快速方法复制内存并进行转换-从ARGB到BGR

66

概要

我有一个图像缓冲区，需要将其转换为另一种格式。原始图像缓冲区是四个通道，每个通道8位，包括Alpha、Red、Green和Blue。目标缓冲区是三个通道，每个通道8位，包括Blue、Green和Red。

所以暴力方法是：

// Assume a 32 x 32 pixel image
#define IMAGESIZE (32*32)

typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

ARGB orig[IMAGESIZE];
BGR  dest[IMAGESIZE];

for(x = 0; x < IMAGESIZE; x++)
{
     dest[x].Red = orig[x].Red;
     dest[x].Green = orig[x].Green;
     dest[x].Blue = orig[x].Blue;
}

然而，我需要比循环和三个字节复制提供的速度更快。我希望有一些技巧可以用来减少内存读写的次数，因为我在32位机器上运行。

附加信息

每个图像至少是4个像素的倍数。所以我们可以处理16个ARGB字节并将它们移动到每个循环的12个RGB字节中。也许这个事实可以用来加速处理，尤其是它刚好落在32位边界上。

我可以使用OpenCL——虽然这需要将整个缓冲区移动到GPU内存中，然后再将结果移回来，但OpenCL可以同时处理图像的许多部分，而且大块内存移动实际上非常高效，这可能是值得探索的。

虽然我上面给出了小缓冲区的例子，但我实际上正在移动高清视频（1920x1080）和有时更大，大多数情况下是较小的缓冲区，因此尽管32x32的情况可能微不足道，但逐字节复制8.3MB的图像数据确实非常糟糕。

在Intel处理器（Core 2及以上）上运行，因此我知道存在流式传输和数据处理命令，但我不知道它们在哪里-也许指向寻找专门的数据处理指令的指针将是有益的。

这将进入一个OS X应用程序，我正在使用XCode 4。如果汇编语言很简单并且是显而易见的方法，我很愿意走上这条路，但是在这个设置上没有这样做让我担心会浪费太多时间。

伪代码也可以-我不需要完整的解决方案，只需要算法和任何可能不明显的花招的解释。

- Adam Davis

3

编译器是否将BGR对齐到DWORD？ - marinara

1

@marinara 不，它是按字节对齐的。 - Adam Davis

1

除非数据是从GPU进入系统，否则使用GPU来处理这个任务是没有意义的。你应该能够通过CPU饱和内存总线。 - Stephan Eggermont

我没有玩过别人的代码，但据我所知没有人提到了相当于

for(x = 0; x < IMAGESIZE; x++) {      dest[x].Red = orig[x].Red; } for(x = 0; x < IMAGESIZE; x++) {      dest[x].Green = orig[x].Green; } for(x = 0; x < IMAGESIZE; x++) {      dest[x].Blue = orig[x].Blue; }

的可能性。在这种情况下，简单的循环是否超过了位操作？ - Mark Hurd

11个回答

25

显然，可以使用pshufb。

#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>

// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
    assert((uintptr_t)orig % 16 == 0);
    assert(imagesize % 4 == 0);
    __m128i mask = _mm_set_epi8(-128, -128, -128, -128, 13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3);
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 16, dest += 12) {
        _mm_storeu_si128((__m128i *)dest, _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), mask));
    }
}

- just a poseur

1

+1 这几乎肯定是最优的。但是可能可以通过不使用非可移植内部函数来让编译器生成相同或类似的代码... - R.. GitHub STOP HELPING ICE

1

希望能够解释一下 _mm_set_epi8 的魔数。 - deceleratedcaviar

1

@丹尼尔，请看看我的答案。 - MSN

16

结合 Poseur 和 Jitamaro 的答案，如果您假设输入和输出均为 16 字节对齐，并且每次处理 4 个像素，您可以使用混洗、掩码、与运算和或运算的组合来使用对齐存储器存储。主要思想是生成四个中间数据集，然后使用掩码将它们与相关像素值选择进行或运算，并写出 3 个 16 字节的像素数据集。请注意，我没有编译或尝试运行此代码。

编辑2：有关基础代码结构的更多细节：

使用 SSE2，通过对 16 字节进行 16 字节对齐读取和写入，可以获得更好的性能。由于每 16 个像素才能使您的 3 字节像素对齐到 16 字节，因此我们使用混洗和掩码以及每次处理 16 个输入像素的或运算来一次批量处理 16 个像素。

从最低有效位到最高有效位，忽略特定组件，输入看起来像这样：

s[0]: 0000 0000 0000 0000
s[1]: 1111 1111 1111 1111
s[2]: 2222 2222 2222 2222
s[3]: 3333 3333 3333 3333

输出结果如下：

d[0]: 000 000 000 000 111 1
d[1]:  11 111 111 222 222 22
d[2]:   2 222 333 333 333 333

为了生成这些输出，你需要按照以下步骤进行操作（我稍后会详细说明实际的转换方法）：

d[0]= combine_0(f_0_low(s[0]), f_0_high(s[1]))
d[1]= combine_1(f_1_low(s[1]), f_1_high(s[2]))
d[2]= combine_2(f_1_low(s[2]), f_1_high(s[3]))

现在，combine_<x> 应该长什么样呢？如果我们假设 d 只是将 s 压缩在一起，我们可以通过应用掩码和或运算符来连接两个 s：

combine_x(left, right)= (left & mask(x)) | (right & ~mask(x))

其中，1表示选择左侧像素，0表示选择右侧像素： mask(0)= 111 111 111 111 000 0 mask(1)= 11 111 111 000 000 00 mask(2)= 1 111 000 000 000 000

但实际的变换（f_<x>_low, f_<x>_high）并不那么简单。由于我们是从源像素中反转和移除字节，因此第一个目标的实际变换为：

d[0]= 
    s[0][0].Blue s[0][0].Green s[0][0].Red 
    s[0][1].Blue s[0][1].Green s[0][1].Red 
    s[0][2].Blue s[0][2].Green s[0][2].Red 
    s[0][3].Blue s[0][3].Green s[0][3].Red
    s[1][0].Blue s[1][0].Green s[1][0].Red
    s[1][1].Blue

如果你把上面的内容按照从源代码到目标代码的字节偏移量转换，你会得到： d[0]= &s[0]+3 &s[0]+2 &s[0]+1
&s[0]+7 &s[0]+6 &s[0]+5 &s[0]+11 &s[0]+10 &s[0]+9 &s[0]+15 &s[0]+14 &s[0]+13
&s[1]+3 &s[1]+2 &s[1]+1
&s[1]+7

（如果你看所有 s[0] 偏移量，它们只是一个骗子的倒置掩码。）

现在，我们可以生成一个掩码来映射每个源字节到目标字节（X 表示我们不关心那个值）：

f_0_low=  3 2 1  7 6 5  11 10 9  15 14 13  X X X  X
f_0_high= X X X  X X X   X  X X   X  X  X  3 2 1  7

f_1_low=    6 5  11 10 9  15 14 13  X X X   X X X  X  X
f_1_high=   X X   X  X X   X  X  X  3 2 1   7 6 5  11 10

f_2_low=      9  15 14 13  X  X  X  X X X   X  X  X  X  X  X
f_2_high=     X   X  X  X  3  2  1  7 6 5   11 10 9  15 14 13

我们可以通过查看每个源像素使用的掩码来进一步优化此过程。如果您查看我们用于 s[1] 的洗牌掩码：

f_0_high=  X  X  X  X  X  X  X  X  X  X  X  X  3  2  1  7
f_1_low=   6  5 11 10  9 15 14 13  X  X  X  X  X  X  X  X

由于两个洗牌掩码不重叠，我们可以将它们合并并简单地在combine_中屏蔽不相关的像素，这一步我们已经完成了！以下代码执行所有这些优化（还假定源地址和目标地址都是16字节对齐的）。此外，掩码以 MSB->LSB 的顺序在代码中输出，以防您对排序产生困惑。

编辑：将存储更改为_mm_stream_si128，因为您可能会进行大量写操作，而我们不希望必须刷新缓存。此外，它应该是对齐的，所以您可以获得免费的性能提升！

#include <assert.h>
#include <inttypes.h>
#include <tmmintrin.h>

// needs:
// orig is 16-byte aligned
// imagesize is a multiple of 4
// dest has 4 trailing scratch bytes
void convert(uint8_t *orig, size_t imagesize, uint8_t *dest) {
    assert((uintptr_t)orig % 16 == 0);
    assert(imagesize % 16 == 0);

    __m128i shuf0 = _mm_set_epi8(
        -128, -128, -128, -128, // top 4 bytes are not used
        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3); // bottom 12 go to the first pixel

    __m128i shuf1 = _mm_set_epi8(
        7, 1, 2, 3, // top 4 bytes go to the first pixel
    -128, -128, -128, -128, // unused
        13, 14, 15, 9, 10, 11, 5, 6); // bottom 8 go to second pixel

    __m128i shuf2 = _mm_set_epi8(
        10, 11, 5, 6, 7, 1, 2, 3, // top 8 go to second pixel
    -128, -128, -128, -128, // unused
        13, 14, 15, 9); // bottom 4 go to third pixel

    __m128i shuf3 = _mm_set_epi8(
        13, 14, 15, 9, 10, 11, 5, 6, 7, 1, 2, 3, // top 12 go to third pixel
        -128, -128, -128, -128); // unused

    __m128i mask0 = _mm_set_epi32(0, -1, -1, -1);
    __m128i mask1 = _mm_set_epi32(0,  0, -1, -1);
    __m128i mask2 = _mm_set_epi32(0,  0,  0, -1);

    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 64, dest += 48) {
        __m128i a= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig), shuf0);
        __m128i b= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 1), shuf1);
        __m128i c= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 2), shuf2);
        __m128i d= _mm_shuffle_epi8(_mm_load_si128((__m128i *)orig + 3), shuf3);

        _mm_stream_si128((__m128i *)dest, _mm_or_si128(_mm_and_si128(a, mask0), _mm_andnot_si128(b, mask0));
        _mm_stream_si128((__m128i *)dest + 1, _mm_or_si128(_mm_and_si128(b, mask1), _mm_andnot_si128(c, mask1));
        _mm_stream_si128((__m128i *)dest + 2, _mm_or_si128(_mm_and_si128(c, mask2), _mm_andnot_si128(d, mask2));
    }
}

- MSN

你能提供 BGRA 到 RGB 的洗牌代码吗？我无法理解这一切是如何工作的。 - Geoffrey

11

我来晚了一点，看起来社区已经决定要使用poseur的pshufb答案并分发2000声望，这太慷慨了，我必须试试。

以下是我的版本，没有特定于平台的机器内置函数或具体机器的汇编语言。我包含了一些跨平台计时代码，展示如果你像我一样进行位操作并且启用编译器优化（寄存器优化，循环展开），可以实现4倍的加速：

#include "stdlib.h"
#include "stdio.h"
#include "time.h"

#define UInt8 unsigned char

#define IMAGESIZE (1920*1080) 
int main() {
    time_t  t0, t1;
    int frames;
    int frame; 
    typedef struct{ UInt8 Alpha; UInt8 Red; UInt8 Green; UInt8 Blue; } ARGB;
    typedef struct{ UInt8 Blue; UInt8 Green; UInt8 Red; } BGR;

    ARGB* orig = malloc(IMAGESIZE*sizeof(ARGB));
    if(!orig) {printf("nomem1");}
    BGR* dest = malloc(IMAGESIZE*sizeof(BGR));
    if(!dest) {printf("nomem2");}

    printf("to start original hit a key\n");
    getch();
    t0 = time(0);
    frames = 1200;
    for(frame = 0; frame<frames; frame++) {
        int x; for(x = 0; x < IMAGESIZE; x++) {
            dest[x].Red = orig[x].Red;
            dest[x].Green = orig[x].Green;
            dest[x].Blue = orig[x].Blue;
            x++;
        }
    }
    t1 = time(0);
    printf("finished original of %u frames in %u seconds\n", frames, t1-t0);

    // on my core 2 subnotebook the original took 16 sec 
    // (8 sec with compiler optimization -O3) so at 60 FPS 
    // (instead of the 1200) this would be faster than realtime 
    // (if you disregard any other rendering you have to do). 
    // However if you either want to do other/more processing 
    // OR want faster than realtime processing for e.g. a video-conversion 
    // program then this would have to be a lot faster still.

    printf("to start alternative hit a key\n");
    getch();
    t0 = time(0);
    frames = 1200;
    unsigned int* reader;
    unsigned int* end = reader+IMAGESIZE;
    unsigned int cur; // your question guarantees 32 bit cpu
    unsigned int next;
    unsigned int temp;
    unsigned int* writer;
    for(frame = 0; frame<frames; frame++) {
        reader = (void*)orig;
        writer = (void*)dest;
        next = *reader;
        reader++;
        while(reader<end) {
            cur = next;
            next = *reader;         
            // in the following the numbers are of course the bitmasks for 
            // 0-7 bits, 8-15 bits and 16-23 bits out of the 32
            temp = (cur&255)<<24 | (cur&65280)<<16|(cur&16711680)<<8|(next&255); 
            *writer = temp;
            reader++;
            writer++;
            cur = next;
            next = *reader;
            temp = (cur&65280)<<24|(cur&16711680)<<16|(next&255)<<8|(next&65280);
            *writer = temp;
            reader++;
            writer++;
            cur = next;
            next = *reader;
            temp = (cur&16711680)<<24|(next&255)<<16|(next&65280)<<8|(next&16711680);
            *writer = temp;
            reader++;
            writer++;
        }
    }
    t1 = time(0);
    printf("finished alternative of %u frames in %u seconds\n", frames, t1-t0);

    // on my core 2 subnotebook this alternative took 10 sec 
    // (4 sec with compiler optimization -O3)

}

这是我的核心2款笔记本电脑的测试结果：

F:\>gcc b.c -o b.exe

F:\>b
to start original hit a key
finished original of 1200 frames in 16 seconds
to start alternative hit a key
finished alternative of 1200 frames in 10 seconds

F:\>gcc b.c -O3 -o b.exe

F:\>b
to start original hit a key
finished original of 1200 frames in 8 seconds
to start alternative hit a key
finished alternative of 1200 frames in 4 seconds

- Bernd Elkemann

1

顺便提一下，这1200帧当然是由1920*1080像素的图像组成的。 - Bernd Elkemann

7

您想使用Duff装置：http://en.wikipedia.org/wiki/Duff%27s_device。它在JavaScript中也可以工作。然而，这篇文章读起来有点有趣http://lkml.indiana.edu/hypermail/linux/kernel/0008.2/0171.html。想象一下一个大小为512 Kbytes的Duff装置。

- Micromega

2

Duff的设备只是一种奇怪的C特定的循环展开方式。要获得真正的良好性能，需要更多的努力。 - user149341

我的C语言和汇编有点生疏，但是展开循环比什么都不做更好，当你必须用CPU移动所有东西时。 - Micromega

1

@R：我不是有经验的C程序员。我正在设计Web应用程序。你能解释一下吗？这有什么好笑的？ - Micromega

6

与这里的快速转换函数结合使用，如果有Core 2s的访问权限，将翻译拆分成线程可能是明智的选择，每个线程处理数据的四分之一，如下伪代码所示：

void bulk_bgrFromArgb(byte[] dest, byte[] src, int n)
{
       thread threads[] = {
           create_thread(bgrFromArgb, dest, src, n/4),
           create_thread(bgrFromArgb, dest+n/4, src+n/4, n/4),
           create_thread(bgrFromArgb, dest+n/2, src+n/2, n/4),
           create_thread(bgrFromArgb, dest+3*n/4, src+3*n/4, n/4),
       }
       join_threads(threads);
}

- Dave

1

真的吗？我本以为瓶颈是在内存访问上，而不是 CPU 处理上，因此使用额外的核心也无法带来任何好处？ - Thomas Padron-McCarthy

3

每个额外的内核都带有相应的L1缓存，因此尽管内存是瓶颈，但使用更多的内核可能会为您提供一些额外的缓存来帮助减轻它。 - Dave

只有当您的线程在已经在缓存中热点的数组部分所在的核心上运行时才有效。例如，如果您有工作线程只是写入src的不同块，或者稍后将在它们刚刚写入的dest部分上执行更多工作，并且正确的线程与正确的块匹配，并希望仍然在同一个CPU核心上运行。否则，这只有在单个CPU核心无法饱和内存带宽时才有用（这在大型Xeon上是情况，在典型的现代四核桌面上并非如此）。 - Peter Cordes

5

这个汇编函数应该可以胜任，但我不知道你是否想保留旧数据，这个函数会覆盖它。

该代码供使用MinGW GCC和英特尔汇编版本，你需要修改它以适应你的编译器/汇编器。

extern "C" {
    int convertARGBtoBGR(uint buffer, uint size);
    __asm(
        ".globl _convertARGBtoBGR\n"
        "_convertARGBtoBGR:\n"
        "  push ebp\n"
        "  mov ebp, esp\n"
        "  sub esp, 4\n"
        "  mov esi, [ebp + 8]\n"
        "  mov edi, esi\n"
        "  mov ecx, [ebp + 12]\n"
        "  cld\n"
        "  convertARGBtoBGR_loop:\n"
        "    lodsd          ; load value from [esi] (4byte) to eax, increment esi by 4\n"
        "    bswap eax ; swap eax ( A R G B ) to ( B G R A )\n"
        "    stosd          ; store 4 bytes to [edi], increment  edi by 4\n"
        "    sub edi, 1; move edi 1 back down, next time we will write over A byte\n"
        "    loop convertARGBtoBGR_loop\n"
        "  leave\n"
        "  ret\n"
    );
}

您应该这样调用:

convertARGBtoBGR( &buffer, IMAGESIZE );

这个函数每个像素/数据包只访问内存两次（1 次读取，1 次写入），与您的 暴力方法 相比，后者至少需要（假设已编译为寄存器）3 次读取和 3 次写入操作。方法相同，但实现使其更有效。

- Sebi

即使您不需要使用另一个add指令来更正指针，stosd指令比mov+add指令慢（https://uops.info/和https://agner.org/optimize/）。在英特尔CPU上，`loop`指令的速度非常慢（https://dev59.com/XFsV5IYBdhLWcg3w0xsK），吞吐量为每5个周期一次，这是此循环的主要瓶颈。此外，这违反了调用约定，破坏了调用者的ESI和EDI寄存器。请使用EDX寄存器代替其中一个寄存器。同时，将参数声明为像普通人一样的`char *buffer，而不是uint`。 - Peter Cordes

4

你可以使用每次处理4个像素的方式，通过无符号长指针移动32位。只需要考虑使用移位和按位AND/OR操作来构建 4 个 32 位像素，就可以得到3个包含 4 个 24 位像素的字，如下：

//col0 col1 col2 col3
//ARGB ARGB ARGB ARGB 32bits reading (4 pixels)
//BGRB GRBG RBGR  32 bits writing (4 pixels)

现代32/64位处理器（采用移位寄存器技术）总是在1个指令周期内完成移位操作，因此这是构建写入这3个字的最快方法，按位AND和OR也非常快速。

就像这样：

//assuming we have 4 ARGB1 ... ARGB4 pixels and  3 32 bits words,  W1, W2 and W3 to write
// and *dest  its an unsigned long pointer for destination
W1 = ((ARGB1 & 0x000f) << 24) | ((ARGB1 & 0x00f0) << 8) | ((ARGB1 & 0x0f00) >> 8) | (ARGB2 & 0x000f);
*dest++ = W1;

等等，以及下一个像素循环。

对于不是4的倍数的图像，您需要进行一些调整，但我打赌这是最快的方法，而不使用汇编语言。

顺便说一句，忘记使用结构体和索引访问，这些都是移动数据的较慢的方式，只需查看编译的C++程序的反汇编列表，您就会同意我的观点。

- ruhalde

3

尽管你可以基于 CPU 使用率使用一些技巧，

This kind of operations can be done fasted with GPU.

似乎你正在使用C/C++...因此，在Windows平台上，你可以选择以下替代方案进行GPU编程：

DirectCompute (DirectX 11) 观看此视频
Microsoft研究项目加速器查看此链接
Cuda
"谷歌"GPU编程...

简而言之，使用GPU进行此类数组操作可以使计算更快。它们就是为此而设计的。

- Novalis

2

不要忘记，从CPU总线访问GPU视频内存比移动在CPU内存映射中完成的操作要慢。要比主处理器更快，需要在视频RAM中执行所有转换。 - ruhalde

3

我没有看到有人展示如何在GPU上完成这个问题的例子。

一段时间以前，我写了类似于您的问题的东西。我从video4linux2摄像头中以YUV格式接收数据，并想将其绘制为屏幕上的灰度级（仅Y分量）。我还想用蓝色绘制过于暗淡的区域并用红色绘制过饱和的区域。

我从freeglut发行版中开始使用smooth_opengl3.c示例。

数据被复制为YUV格式进入纹理，然后应用以下GLSL着色器程序。我确信现在所有的mac都能运行GLSL代码，并且它比所有CPU方法都要快得多。

请注意，我不知道您如何获取数据。理论上，glReadPixels应该读取数据，但我从未测量过其性能。

OpenCL可能是更简单的方法，但只有当我有支持它的笔记本电脑时才会开始开发。

(defparameter *vertex-shader*
"void main(){
    gl_Position    = gl_ModelViewProjectionMatrix * gl_Vertex;
    gl_FrontColor  = gl_Color;
    gl_TexCoord[0] = gl_MultiTexCoord0;
}
")

(progn
 (defparameter *fragment-shader*
   "uniform sampler2D textureImage;
void main()
{
  vec4 q=texture2D( textureImage, gl_TexCoord[0].st);
  float v=q.z;
  if(int(gl_FragCoord.x)%2 == 0)
     v=q.x; 
  float x=0; // 1./255.;
  v-=.278431;
  v*=1.7;
  if(v>=(1.0-x))
    gl_FragColor = vec4(255,0,0,255);
  else if (v<=x)
    gl_FragColor = vec4(0,0,255,255);
  else
    gl_FragColor = vec4(v,v,v,255); 
}
")

enter image description here

- whoplisp

它将比所有CPU方法都快得多。对于小缓冲区，特别是已经在CPU核心的L2缓存中热的缓冲区，则不适用。但是，对于许多大缓冲区来说，这可能是一个好选择，尤其是当您同时让CPU处理其他任务时。如果您可以有效地将数据传回CPU，那么GPU-> CPU传输通常并不是非常快速的。 - Peter Cordes

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ughoavgfhw · Accepted Answer

我写了4个不同的版本，通过交换字节来实现。我使用gcc 4.2.1编译它们，使用-O3 -mssse3选项，对32MB的随机数据运行了10次，并找到了平均值。

编辑注：原始的内联汇编使用了不安全的约束条件，例如修改仅输入操作数，并且没有告诉编译器指针输入所指向的内存的副作用。显然，这对基准测试效果良好。我修复了约束条件，以使其对所有调用者都适用。这不应影响基准测试结果，只需确保周围的代码对所有调用者都是安全的。具有更高内存带宽的现代CPU应该会看到SIMD比每次处理4个字节的标量更快，但当数据在缓存中时，最大的优势是在小块上工作或处理较小的总大小。

在2020年，你最好使用可移植的_mm_loadu_si128内部函数版本，它将编译成等效的汇编循环：https://gcc.gnu.org/wiki/DontUseInlineAsm.

还要注意，所有这些都会覆盖输出末尾1（标量）或4（SIMD）个字节，因此如果这是一个问题，请单独处理最后3个字节。

--- @PeterCordes

第一个版本使用C循环单独转换每个像素，使用OSSwapInt32函数（在-O3下编译为bswap指令）。

void swap1(ARGB *orig, BGR *dest, unsigned imageSize) {
    unsigned x;
    for(x = 0; x < imageSize; x++) {
        *((uint32_t*)(((uint8_t*)dest)+x*3)) = OSSwapInt32(((uint32_t*)orig)[x]);
        // warning: strict-aliasing UB.  Use memcpy for unaligned loads/stores
    }
}

第二种方法执行相同的操作，但使用内联汇编循环而不是C循环。

void swap2(ARGB *orig, BGR *dest, unsigned imageSize) {
    asm volatile ( // has to be volatile because the output is a side effect on pointed-to memory
        "0:\n\t"                   // do {
        "movl   (%1),%%eax\n\t"
        "bswapl %%eax\n\t"
        "movl   %%eax,(%0)\n\t"    // copy a dword byte-reversed
        "add    $4,%1\n\t"         // orig += 4 bytes
        "add    $3,%0\n\t"         // dest += 3 bytes
        "dec    %2\n\t"
        "jnz    0b"                // }while(--imageSize)
        : "+r" (dest), "+r" (orig), "+r" (imageSize)
        : // no pure inputs; the asm modifies and dereferences the inputs to use them as read/write outputs.
        : "flags", "eax", "memory"
    );
}

第三个版本是一个冒牌者的回答的修改版。我将内置函数转换为GCC等效函数，并使用lddqu内置函数，因此输入参数不需要对齐。(编辑注：只有P4从lddqu中受益；使用movdqu没有任何缺点。)

typedef char v16qi __attribute__ ((vector_size (16)));
void swap3(uint8_t *orig, uint8_t *dest, size_t imagesize) {
    v16qi mask = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
    uint8_t *end = orig + imagesize * 4;
    for (; orig != end; orig += 16, dest += 12) {
        __builtin_ia32_storedqu(dest,__builtin_ia32_pshufb128(__builtin_ia32_lddqu(orig),mask));
    }
}

最后，第四个版本是第三个版本的内联汇编等效版本。

void swap2_2(uint8_t *orig, uint8_t *dest, size_t imagesize) {
    static const int8_t mask[16] = {3,2,1,7,6,5,11,10,9,15,14,13,0xFF,0xFF,0xFF,0XFF};
    asm volatile (
        "lddqu  %3,%%xmm1\n\t"
        "0:\n\t"
        "lddqu  (%1),%%xmm0\n\t"
        "pshufb %%xmm1,%%xmm0\n\t"
        "movdqu %%xmm0,(%0)\n\t"
        "add    $16,%1\n\t"
        "add    $12,%0\n\t"
        "sub    $4,%2\n\t"
        "jnz    0b"
        : "+r" (dest), "+r" (orig), "+r" (imagesize)
        : "m" (mask)  // whole array as a memory operand.  "x" would get the compiler to load it
        : "flags", "xmm0", "xmm1", "memory"
    );
}

这些在GCC9.3中都可以正常编译,但clang10不认识__builtin_ia32_pshufb128，需要使用_mm_shuffle_epi8。

在我的2010款MacBook Pro上，2.4 GHz i5（Westmere/Arrandale），4GB RAM，每个版本的平均时间如下：

Version 1：10.8630 毫秒
Version 2：11.3254 毫秒
Version 3：9.3163 毫秒
Version 4：9.3584 毫秒

可以看出，编译器已经足够优化，不需要编写汇编代码。此外，在32MB数据上，矢量函数仅快了1.5毫秒，因此如果要支持不支持SSSE3的早期Intel Mac，则不会造成太大影响。

编辑：liori要求提供标准偏差信息。不幸的是，我没有保存数据点，所以我又进行了25次迭代测试。

              平均值     | 标准偏差
Brute force: 18.01956 ms | 1.22980 ms (6.8%)
Version 1:   11.13120 ms | 0.81076 ms (7.3%)
Version 2:   11.27092 ms | 0.66209 ms (5.9%)
Version 3:    9.29184 ms | 0.27851 ms (3.0%)
Version 4:    9.40948 ms | 0.32702 ms (3.5%)

此外，以下是新测试的原始数据，以防有人需要。对于每次迭代，生成一个32MB的数据集，并运行四个函数。每个函数的运行时间以微秒为单位列在下面。

暴力破解: 22173 18344 17458 17277 17508 19844 17093 17116 19758 17395 18393 17075 17499 19023 19875 17203 16996 17442 17458 17073 17043 18567 17285 17746 17845 版本1: 10508 11042 13432 11892 12577 10587 11281 11912 12500 10601 10551 10444 11655 10421 11285 10554 10334 10452 10490 10554 10419 11458 11682 11048 10601 版本2: 10623 12797 13173 11130 11218 11433 11621 10793 11026 10635 11042 11328 12782 10943 10693 10755 11547 11028 10972 10811 11152 11143 11240 10952 10936 版本3: 9036 9619 9341 8970 9453 9758 9043 10114 9243 9027 9163 9176 9168 9122 9514 9049 9161 9086 9064 9604 9178 9233 9301 9717 9156 版本4: 9339 10119 9846 9217 9526 9182 9145 10286 9051 9614 9249 9653 9799 9270 9173 9103 9132 9550 9147 9157 9199 9113 9699 9354 9314