现代的C++编译器是否会自动向量化24位图像处理的代码?

4

像gcc、Visual Studio C++、Intel C++编译器、Clang等编译器会将以下代码向量化吗?

std::vector<unsigned char> img( height * width * 3 );
unsigned char channelMultiplier[3];

// ... initialize img and channelMultiplier ...

for ( int y = 0; y < height; ++y )
    for ( int x = 0; x < width; ++x )
        for ( b = 0; b < 3; ++b )
            img[ b+3*(x+width*y) ] = img[ b+3*(x+width*y) ] * 
                                     channelMultiplier[b] / 0x100;

同样的方式也适用于32位图像处理吗?

2
这完全取决于您使用的编译器和选择的优化选项。此外,您可以要求编译器生成汇编清单并自行检查。 - Greg Hewgill
假设您选择了最强大的优化选项。 - Ralph Tandetzky
很少有可能会出现编译器在没有您给出任何提示的情况下生成了优化良好的代码。这就是像英特尔性能优化库这样的包存在的原因:它们通过优化提供了编译器无法达到的性能微调。然而,正如Greg上面所说,对于您特定的代码,您需要查看汇编并检查实际存在哪些优化。 - Roman R.
1个回答

8

我认为你的三重循环不会自动向量化。我个人认为问题在于:

  • Memory is accessed through an object type std::vector. AFAIK I don't think any compiler will auto-vectorize std::vector code unless the access operators [] or () are inlined but still, it is not clear to me that it will be auto-vectorized.
  • Your code suffers from memory aliasing, i.e. the compiler doesn't know if the memory you refer to img is accessed from another memory pointer and this will most likely block the vectorization. Basically you need to define a plain double array and hint the compiler that no other pointer is referring to that same location. I think you can do that using __restrict. __restrict tells the compiler that this pointer is the only pointer pointing to that memory location and that there are no other pointers, and thus there is no risk of side effects.
  • The memory is not aligned by default and even if the compiler manages to auto-vectorize, the vectorization of unaligned memory is a lot slower than that of aligned memory. You need to ensure your memory is 32 memory bit address aligned to exploit auto-vectorization and AVX to the maximum and 16 bit address aligned to exploit SSE to the maximum i.e. always align to 32 memory bit address. This you can do dynamically via:

    double* buffer = NULL;
    posix_memalign((void**) &buffer, 32, size*sizeof(double));
    ...
    free(buffer);
    

在MSVC中,您可以使用__declspec(align(32)) double array[size]来实现此操作,但是您必须检查您正在使用的特定编译器以确保您正在使用正确的对齐指令。

另一个重要的事情是,如果您使用GNU编译器,请使用标志-ftree-vectorizer-verbose=6检查您的循环是否被自动矢量化。如果您使用英特尔编译器,则使用-vec-report5。请注意,有几个详细程度和信息输出级别,即数字6和5,因此请查看编译器文档。详细程度越高,您将获得有关代码中每个循环的更多矢量化信息,但编译器在发布模式下编译速度越慢。

总的来说,我总是惊讶于让编译器自动矢量化并不容易,认为循环看起来规范就会自动矢量化,这是一个常见错误。

更新:还有一件事,确保您的img实际上是页面对齐的posix_memalign((void**) &buffer, sysconf(_SC_PAGESIZE), size*sizeof(double));(这意味着AVX和SSE对齐)。问题在于,如果您有一个大图像,此循环在执行过程中最可能会进行页面切换,这也非常昂贵。我认为这就是所谓的TLB错误。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接