帮助GCC进行自动向量化

Question

帮助GCC进行自动向量化

c++gccmingwssevectorization

4

我有一个需要优化的着色器（涉及大量向量运算），我正在尝试使用SSE指令以更好地理解问题。

我有一些非常简单的示例代码。在使用USE_SSE定义时，它使用显式的SSE内置函数；如果没有定义，则希望GCC为我完成工作。自动向量化感觉有点棘手，但我希望它能节省我的一些精力。

编译器和平台为：gcc 4.7.1（tdm64），目标x86_64-w64-mingw32，Ivy Bridge上的Windows 7。

以下是测试代码：

/*
    Include all the SIMD intrinsics.
*/
#ifdef USE_SSE
#include <x86intrin.h>
#endif
#include <cstdio>

#if   defined(__GNUG__) || defined(__clang__) 
    /* GCC & CLANG */

    #define SSVEC_FINLINE __attribute__((always_inline))

#elif defined(_WIN32) && defined(MSC_VER) 
    /* MSVC. */

    #define SSVEC_FINLINE __forceinline

#else
#error Unsupported platform.
#endif


#ifdef USE_SSE

    typedef __m128 vec4f;

    inline void addvec4f(vec4f &a, vec4f const &b)
    {
        a = _mm_add_ps(a, b);
    }

#else

    typedef float vec4f[4];

    inline void addvec4f(vec4f &a, vec4f const &b)
    {
        a[0] = a[0] + b[0];
        a[1] = a[1] + b[1];
        a[2] = a[2] + b[2];
        a[3] = a[3] + b[3];
    }

#endif

int main(int argc, char *argv[])
{
    int const count = 1e7;

    #ifdef USE_SSE
    printf("Using SSE.\n");
    #else
    printf("Not using SSE.\n");
    #endif

    vec4f data = {1.0f, 1.0f, 1.0f, 1.0f};

    for (int i = 0; i < count; ++i)
    {
        vec4f val = {0.1f, 0.1f, 0.1f, 0.1f};
        addvec4f(data, val);
    }

    float result[4] = {0};
    #ifdef USE_SSE
    _mm_store_ps(result, data);
    #else
    result[0] = data[0];
    result[1] = data[1];
    result[2] = data[2];
    result[3] = data[3];
    #endif

    printf("Result: %f %f %f %f\n", result[0], result[1], result[2], result[3]);

    return 0;
}

这是使用以下方式编译的：

g++ -O3 ssetest.cpp -o nossetest.exe
g++ -O3 -DUSE_SSE ssetest.cpp -o ssetest.exe

除了明确的SSE版本稍微快一点外，输出没有任何区别。

这是循环的汇编代码，首先是明确的SSE：

.L3:
subl    $1, %eax
addps   %xmm1, %xmm0
jne .L3

它内联了函数调用。很好，基本上只是一个直接的_mm_add_ps。

数组版本：

.L3:
subl    $1, %eax
addss   %xmm0, %xmm1
addss   %xmm0, %xmm2
addss   %xmm0, %xmm3
addss   %xmm0, %xmm4
jne .L3

它确实在每个数组成员上使用了 SSE 算法，但不是很理想。

我的问题是，我该如何帮助 GCC 更好地优化 vec4f 的数组版本？

任何关于 Linux 的专业建议都很有用，因为真正的代码将在 Linux 上运行。

- Skurmedel

请注意，float result[4] 在堆栈上可能不是16字节对齐的 - 在这种情况下它可以工作，否则 _mm_store_ps 将会出错。 - Brett Hale

2个回答

4

以下是基于您的代码的一些提示，以使gcc自动向量化起作用：

make the loop-upbound a const. To vectorize, GCC need to split the loop by 4-iterations to fit in the SSE XMM register, which is 128-bit length. a const loop upper bound will help GCC make sure that the loop have plenty of iterations, and the vectorization is profitable.
remove the inline keyword. if the code is marked as inline, GCC can not know whether the start point of the array is aligned without inter-procedure analysis which will not turned on by -O3.

so, to make your code vectorized, your addvec4f function should be modified as the following:
```
void addvec4f(vec4f &a, vec4f const &b)
{
    int i = 0;
    for(;i < 4; i++)
      a[i] = a[i]+b[i];
}
```

顺便提一下：

GCC也有标志帮助你找出循环是否被矢量化。 -ftree-vectorizer-verbose=2，数字越高输出的信息越多，目前可以是0,1,2。这里是该标志的文档，以及其他相关标志。
注意对齐方式。数组的地址应该对齐，而编译器无法在运行之前确定地址是否对齐。通常，如果数据未对齐，会出现总线错误。这里是原因。

- Kun Ling

谢谢，我会尝试你的建议。 - Skurmedel

通常来说，了解调用点上正在发生的细节通常是有帮助的，而不是有害的。你对移除inline的论点毫无道理可言。alignof(vec4f)为16，因此编译器总是可以假设vec4f&是对齐的；如果不是这样的话，那将是未定义行为。此外，这个函数是对应一个应该是单个指令的包装器；如果它实际上没有内联，那将是一场彻底的灾难，尤其是对于引用参数。（尽管在x86-64 System V中没有保留调用的XMM寄存器，对于按值传递的矢量参数和返回值来说，情况只会稍微好一点。） - undefined

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Shafik Yaghmour · Accepted Answer

这篇关于使用gcc 4.7自动向量化的LockLess文章无疑是我见过的最好的文章，而我已经花了一段时间寻找类似主题的好文章。他们还有很多其他文章，你可能会发现它们在低级软件开发的各种问题上都非常有用。