如何帮助GCC将C代码向量化。

Question

如何帮助GCC将C代码向量化。

8

我有以下C代码。第一部分从标准输入中读取由复数组成的矩阵到名为M的矩阵中。有趣的部分是第二部分。

#include <stdio.h>
#include <complex.h>
#include <stdlib.h>
#include <assert.h>
#include <math.h>

int main() {
    int n, m, c, d;
    float re, im;

    scanf("%d %d", &n, &m);
    assert(n==m);
    complex float M[n][n];

    for(c=0; c<n; c++) {
      for(d=0; d<n; d++) {
    scanf("%f%fi", &re, &im);
    M[c][d] = re + im * I;
      }
    }

    for(c=0; c<n; c++) {
      for(d=0; d<n; d++) {
        printf("%.2f%+.2fi ", creal(M[c][d]), cimag(M[c][d]));
      }
      printf("\n");
    }
/*
Example:input   
2 3
1+2i 2+3i 74-4i
3+4i 4+5i -7-8i
*/
    /* Part 2. M is now an n by n matrix of complex numbers */
    int s=1, i, j;
    int *f = malloc(n * sizeof *f);
    complex float *delta = malloc(n * sizeof *delta);
    complex float *v = malloc(n * sizeof *v);
    complex float p = 1, prod;

    for (i = 0; i < n; i++) {
      v[i] = 0;
      for (j = 0; j <n; j++) {
        v[i] += M[j][i];
      }
      p *= v[i];
      f[i] = i;
      delta[i] = 1;
    }
    j = 0;
    while (j < n-1) {
      prod = 1.;
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;
      f[0] = 0;
      f[j] = f[j+1];
      f[j+1] = j+1;
      j = f[0];
    }
    free(delta);
    free(f);
    free(v);
    printf("%f + i%f\n", creal(p/pow(2.,(n-1))), cimag(p/pow(2.,(n-1))));
    return 0;
}

我使用 gcc -fopt-info-vec-all -O3 -ffast-math -march=bdver2 permanent-in-c.c -lm 进行编译。这让我明白为什么几乎没有循环被向量化。

对于性能而言，最重要的部分是第47到50行，它们是：

for (i = 0; i < n; i++) {
    v[i] -= 2.*delta[j]*M[j][i];
    prod *= v[i];
}

gcc 告诉我：

permanent-in-c.c:47:7: note: reduction used in loop.
permanent-in-c.c:47:7: note: Unknown def-use cycle pattern.
permanent-in-c.c:47:7: note: reduction used in loop.
permanent-in-c.c:47:7: note: Unknown def-use cycle pattern.
permanent-in-c.c:47:7: note: Unsupported pattern.
permanent-in-c.c:47:7: note: not vectorized: unsupported use in stmt.
permanent-in-c.c:47:7: note: unexpected pattern.
[...]
permanent-in-c.c:48:26: note: SLP: step doesn't divide the vector-size.
permanent-in-c.c:48:26: note: Unknown alignment for access: IMAGPART_EXPR <*M.4_40[j_202]{lb: 0 sz: pretmp_291 * 4}[i_200]>
permanent-in-c.c:48:26: note: SLP: step doesn't divide the vector-size.
permanent-in-c.c:48:26: note: Unknown alignment for access: REALPART_EXPR <*M.4_40[j_202]{lb: 0 sz: pretmp_291 * 4}[i_200]>
[...]
permanent-in-c.c:48:26: note: Build SLP failed: unrolling required in basic block SLP
permanent-in-c.c:48:26: note: Failed to SLP the basic block.
permanent-in-c.c:48:26: note: not vectorized: failed to find SLP opportunities in basic block.

我该如何解决阻止这部分向量化的问题？

有趣的是，这一部分已被向量化，但我不确定原因：

for (j = 0; j <n; j++) {
    v[i] += M[j][i];

完整的 gcc -fopt-info-vec-all -O3 -ffast-math -march=bdver2 permanent-in-c.c -lm 的输出位于 https://bpaste.net/show/18ebc3d66a53。

- Simd

1

SIMD指令需要16字节对齐，而malloc并不保证这一点。如果您切换到posix_memalign，会发生什么？ - Sean Bright

2

实际上，我收回我说过的一切。我看到的来源都说，在64位系统上，malloc将始终返回16字节对齐的指针，所以这可能是一个误导。 - Sean Bright

4

你正在要求编译器对一个复数乘法进行矢量化。虽然这是可能的（手动完成），但我不确定编译器是否聪明到可以自动完成它。 - Mysticial

2

@eleanora 很可能是这样。即使你手动操作，也会有很多洗牌开销。如果你真的关心性能，就不能使用complex类型，因为它是AOS打包。你需要重新排列数据到SOA。 - Mysticial

2

我不信任也不期望GCC向量化器能够产生真正好的结果；相反，我使用 GCC向量扩展（使用单独的矩阵和向量来处理实部和虚部）。代码最终变得几乎是只写不读的，因此我需要保留一个非向量化版本的同一函数（用于参考和单元/比较测试）。 - Nominal Animal

显示剩余9条评论

3个回答

4

我想我可能已经找到了解决方法。经过多次尝试和错误，明确的是gcc内置的向量化优化有点硬编码，并且它不能正确理解复数。我在代码中做了一些更改，让您的性能敏感循环进行了向量化，由gcc输出确认（虽然我不确定期望结果是否与您想要的计算等效）。尽管我的理解仅限于您想要代码执行的内容，但发现计算实部和虚部分别将可以正常运作。请看：

    float t_r = 0.0, t_im = 0.0; // two new temporaries  
    while (j < n-1) {
        prod = 1.;
        for (i = 0; i < n; i++) {
// fill the temps after subtraction from V to avoid stmt error
            t_r = creal (v[i]) - (2. * creal(delta[j]) * creal (M[j][i]));
            t_im = cimag(v[i]) - (2. * cimag(delta[j]) * cimag (M[j][i])) * I;
            //v[i] = 2.*delta[j]*M[j][i];
            v[i] = t_r + t_im; // sum of real and img
            prod *= v[i];
        }
        delta[j] = -delta[j];
        s = -s;            
        p += s*prod;
        f[0] = 0;
        f[j] = f[j+1];
        f[j+1] = j+1;
        j = f[0];
    }

- 16tons

据我所知，这并没有帮助。我认为问题在于 prod *= v[i]; 部分，gcc 不知道如何进行矢量化处理。 - Simd

我忘了在我的回答中写。如果你将实部和虚部分开，'prod'部分可以很好地向量化。我在写评论后进行了测试（我可以确认看到两个LOOP VECTORIZED语句，一个是针对v[i]，另一个是针对prod）。我认为你最好使用一个结构体来自己保存实部和虚部，以便更好地控制乘法。 - 16tons

1

我认为这只是将不可避免的问题转移到了代码的另一个部分。请参见例如 https://godbolt.org/g/WdMvIh 并搜索 mulss。我认为在gcc中唯一的方法是使用它的https://gcc.gnu.org/onlinedocs/gcc/Vector-Extensions.html，可惜了。 - Simd

1

@eleanora 只是为了澄清，尽管GCC确实设法对其进行了矢量化，但它的矢量化效果非常差。重申之前的评论，如果你想要良好的矢量化效果，就不能使用complex类型。你需要将实部和虚部分别拆分到不同的数组中。如果你愿意手动进行矢量化，还有其他方法可供选择，但无论哪种方式，都需要改变数据布局。 - Mysticial

@Mysticial，您能展示一下如何将矩阵M向量化吗？ - Simd

显示剩余7条评论

-1

优化器日志清楚地表明：

访问的未知对齐方式：...

在尝试进行向量化时出现。

printf("%.2f%+.2fi ", creal(M[c][d]), cimag(M[c][d])); //24
v[i] += M[j][i]; //38
p *= v[i]; //40
v[i] -= 2.*delta[j]*M[j][i]; //48

看起来你需要强制对数组 M、delta 和 v 在内存中进行对齐才能使它们真正关联起来。

GCC 中的自动向量化

只处理对齐内存访问（不尝试对包含非对齐访问的循环进行向量化）

正如先前的评论所提到的，我建议您使用 posix_memalign 来实现这个目的。

complex float * restrict delta;
posix_memalign(&delta, 64, n * sizeof *delta); //to adapt

你的目标环境是什么？（操作系统，CPU）

请查看data-alignment-to-assist-vectorization。

- Jeandey Boris

这确实有所改善，但令人遗憾的是它仍然无法向量化。请参见http://bpaste.net/show/561f6b8d2e21。第47和48行仍然是相关的。 - Simd

目标操作系统是Ubuntu，CPU 是 AMD FX 8350。GCC 将其识别为 -march=bdver2。 - Simd

如果你将第47-50行的循环分成两部分，似乎问题实际上完全在于prod *= v[i];这一部分。gcc似乎不知道如何对一个复数数组的乘积进行向量化处理。 - Simd

prod *= v[i];会破坏向量化，因为prod不是一个向量。 - Jeandey Boris

好的，但是英特尔C编译器似乎能够将数组的乘积向量化。例如，请参见https://godbolt.org/g/q3neAU - Simd

1

事实上，英特尔 C 编译器似乎可以向量化和展开乘法序列。我认为英特尔 C 编译器可以处理比 gcc 更复杂的 SLP 模式。 - Jeandey Boris

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Nominal Animal · Accepted Answer

首先，让我们详细地检查这段代码。我们有：

complex float  M[rows][cols];
complex float  v[cols];
float          delta[rows];
complex float  p = 1.0f;
float          s = 1.0f;

delta[]在OP的代码中是complex float类型，但它只包含-1.0f或者+1.0f。（此外，如果它是-2.0f或+2.0f，计算会更简单。）因此，我将其定义为实数而不是复数。

同样地，OP将s定义为int，但实际上只使用了-1.0f和+1.0f（在计算中）。这就是我将其显式定义为float的原因。

我省略了f数组，因为完全有一种简单的方法可以避免使用它。

代码中有趣部分的第一个循环：

    for (i = 0; i < n; i++) {
      v[i] = 0;
      for (j = 0; j <n; j++) {
        v[i] += M[j][i];
      }
      p *= v[i];
      delta[i] = 1;
    }

执行了几件事情。它将delta []数组中的所有元素初始化为1；它可以（并且可能应该）拆分为单独的循环。

由于外部循环增加了i，因此p将是v中元素的乘积；它也可以拆分成单独的循环。

由于内部循环将第i列中的所有元素相加到v [i]，因此外部和内部循环仅将每行作为向量相加到向量v。

因此，我们可以将上述内容重写为伪代码：

Copy first row of matrix M to vector v
For r = 1 .. rows-1:
    Add complex values in row r of matrix M to vector v

p = product of complex elements in vector v

delta = 1.0f, 1.0f, 1.0f, .., 1.0f, 1.0f

接下来我们看第二个嵌套循环：

    j = 0;
    while (j < n-1) {
      prod = 1.;
      for (i = 0; i < n; i++) {
        v[i] -= 2.*delta[j]*M[j][i];
        prod *= v[i];
      }
      delta[j] = -delta[j];
      s = -s;            
      p += s*prod;
      f[0] = 0;
      f[j] = f[j+1];
      f[j+1] = j+1;
      j = f[0];
    }

只有在观察循环进展的 j 的值时才能看到，但是外层循环主体的最后四行实现了OEISA007814整数序列在j（0,1,0,2,0,1,0,3,0,1,0,2,0,1,0,4,...）中。该循环迭代次数为2^rows-1-1。这部分序列是对称的，并实现了高度为rows-1的二叉树：

               4
       3               3
   2       2       2       2     (Read horizontally)
 1   1   1   1   1   1   1   1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

原来，如果我们循环遍历 i = 1 .. 2^rows-1，那么 r 就是 i 中零位的数量。GCC 提供了一个__builtin_ctz() 内置函数来计算这个值。（注意，__builtin_ctz(0) 的值是未定义的；所以即使它在您的电脑上产生了特定的值，也不要这样做。）

内部循环从矩阵的第 j 行中减去按比例缩放的复数值，并从向量 v[] 中计算出复数项的乘积（在减法之后），并将其存储在变量 prod 中。

在内部循环之后，将反转 delta[j] 和比例因子 s 的值。将变量 prod 的值乘以 s 的值后加到 p 上。

在循环之后，最终（复数）结果是将 p 除以 2^rows-1 得到的。这可以使用 C99 函数ldexp()（分别在实部和虚部上）更好地完成。

因此，我们可以将第二个嵌套循环重写为伪代码。

s = 1.0f

For k = 1 .. rows-1, inclusive:
    r = __builtin_ctz(k), i.e. number of least
                          significant bits that
                          are zero in k

    Subtract the complex values on row r of matrix M,
        scaled by delta[r], from vector v[]

    prod = the product of (complex) elements in vector v[]

    Negate scale factor s (changing its sign)

    Add prod times s to result p

在我的经验中，将实部和虚部分别拆分为不同的向量和矩阵是最好的。请考虑以下定义：

typedef struct {
    float  *real;
    float  *imag;
    size_t  floats_per_row;  /* Often 'stride' */
    size_t  rows;
    size_t  cols;
} complex_float_matrix;

/* Set an array of floats to a specific value */
void float_set(float *, float, size_t);

/* Copy an array of floats */
void float_copy(float *, const float *, size_t);

/* malloc() vector-aligned memory; size in floats */
float *float_malloc(size_t);

/* Elementwise addition of floats */
void float_add(float *, const float *, size_t);

/* Elementwise addition of floats, scaled by a real scale factor */
void float_add_scaled(float *, const float *, float, size_t);

/* Complex float product, separate real and imag arrays */
complex float complex_float_product(const float *, const float *, size_t);

所有上述内容都可以轻松进行矢量化，只要float_malloc()返回的指针具有足够的对齐度（并且通过例如GCC函数属性__attribute__((__assume_aligned__(BYTES_IN_VECTOR)));告知编译器），并且矩阵中的floats_per_row成员是向量中浮点数数量的倍数。

（我不知道GCC是否可以自动矢量化上述函数，但我知道可以使用GCC矢量扩展手动进行矢量化。）

有了上述内容，整个代码的有趣部分，以伪C语言表示，变成了：

complex float permanent(const complex_float_matrix *m)
{
    float         *v_real, *v_imag;
    float         *scale;   /* OP used 'delta' */
    complex float  result;  /* OP used 'p'     */
    complex float  product; /* OP used 'prod'  */
    float          coeff = 1.0f; /* OP used 's' */
    size_t         i = 1 << (m->rows - 1);
    size_t         r;

    if (!m || m->cols < 1 || m->rows < 1 || !i) {
        /* TODO: No input matrix, or too many rows in input matrix */
    }

    v_real = float_malloc(m->cols);
    v_imag = float_malloc(m->cols);
    scale  = float_malloc(m->rows - 1);
    if (!v_real || !v_imag || !scale) {
        free(scale);
        free(v_imag);
        free(v_real);
        /* TODO: Out of memory error */
    }

    float_set(scale, 2.0f, m->rows - 1);

    /* Sum matrix rows to v. */

    float_copy(v_real, m->real, m->cols);
    for (r = 1; r < m->rows; r++)
        float_add(v_real, m->real + r * m->floats_per_row, m->cols);

    float_copy(v_imag, m->imag, m->cols);
    for (r = 1; r < m->rows; r++)
        float_add(v_imag, m->imag + r * m->floats_per_row, m->cols);

    result = complex_float_product(v_real, v_imag, m->cols);

    while (--i) {
        r = __builtin_ctz(i);

        scale[r] = -scale[r];

        float_add_scaled(v_real, m->real + r * m->floats_per_row, m->cols);
        float_add_scaled(v_imag, m->imag + r * m->floats_per_row, m->cols);

        product = complex_float_product(v_real, v_imag, m->cols);

        coeff = -coeff;

        result += coeff * product;
    }

    free(scale);
    free(v_imag);
    free(v_real);

    return result;
}

目前，我个人会先不使用向量化实现上述内容，然后进行广泛测试，直到确定其正确性。

然后，我会检查GCC汇编输出（-S）以查看它是否可以足够地对各个操作（我之前列出的函数）进行向量化。

使用GCC向量扩展手动对函数进行向量化非常简单。声明一个浮点向量很容易：

typedef  float  vec2f __attribute__((vector_size (8), aligned (8)));  /* 64 bits; MMX, 3DNow! */
typedef  float  vec4f __attribute__((vector_size (16), aligned (16))); /* 128 bits; SSE */
typedef  float  vec8f __attribute__((vector_size (32), aligned (32))); /* 256 bits; AVX, AVX2 */
typedef  float  vec16f __attribute__((vector_size (64), aligned (64))); /* 512 bits; AVX512F */

每个向量中的单独组件可以使用数组表示法（对于vec2f v;，使用v [0]和v [1]）。GCC可以对整个向量逐元素执行基本操作；在这里我们只需要加法和乘法。应避免水平操作（操作在同一向量中的元素），而应重新排序元素。

即使在没有此类向量化的体系结构上，GCC也会为上述向量大小生成可用的代码，但生成的代码可能会很慢。（至少到GCC版本5.4，在堆栈和返回之间会生成许多不必要的移动指令。）

为向量分配的内存需要具有足够的对齐性。 malloc()并非在所有情况下都提供足够对齐的内存；您应该改用posix_memalign()。在本地或静态分配向量类型时，可以使用aligned属性来增加GCC使用的对齐方式。在矩阵中，您需要确保行从足够对齐的边界开始；这就是我在结构中使用floats_per_row变量的原因。

在向量（或行）中的元素数量很大，但又不是向量中浮点数数量的倍数的情况下，您应使用“惰性”值填充向量--这些值不会影响结果，例如加法和减法用0.0f，乘法用1.0f。

至少在x86和x86-64上，GCC将为仅使用指针的循环生成更好的代码。例如：

void float_set(float *array, float value, size_t count)
{
    float *const limit = array + count;
    while (array < limit)
        *(array++) = value;
}

比起其他写法，这种写法能够生成更好的代码。

void float_set(float *array, float value, size_t count)
{
    size_t i;
    for (i = 0; i < count; i++)
        array[i] = value;
}

or

void float_set(float *array, float value, size_t count)
{
    while (count--)
        *(array++) = value;
}

(这些通常会产生相似的代码)。个人而言，我会将其实现为

void float_set(float *array, float value, size_t count)
{
    if (!((uintptr_t)array & 7) && !(count & 1)) {
        uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
        uint64_t       *ptr = (uint64_t *)__builtin_assume_aligned((void *)array, 8);
        uint64_t        val;

        __builtin_memcpy(&val, &value, 4);
        __builtin_memcpy(4 + (char *)&val, &value, 4);

        while (ptr < end)
            *(ptr++) = val;
    } else {
        uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
        uint32_t       *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
        uint32_t        val;

        __builtin_memcpy(&val, &value, 4);

        while (ptr < end)
            *(ptr++) = val;
    }
}

还有float_copy()

void float_copy(float *target, const float *source, size_t count)
{
    if (!((uintptr_t)source & 7) &&
        !((uintptr_t)target & 7) && !(count & 1)) {
        uint64_t *const end = (uint64_t *)__builtin_assume_aligned((void *)(array + count), 8);
        uint64_t       *ptr = (uint64_t *)__builtin_assume_aligned((void *)target, 8);
        uint64_t       *src = (uint64_t *)__builtin_assume_aligned((void *)source, 8);

        while (ptr < end)
            *(ptr++) = *(src++);

    } else {
        uint32_t *const end = (uint32_t *)__builtin_assume_aligned((void *)(array + count), 4);
        uint32_t       *ptr = (uint32_t *)__builtin_assume_aligned((void *)array, 4);
        uint32_t       *src = (uint32_t *)__builtin_assume_aligned((void *)source, 4);

        while (ptr < end)
            *(ptr++) = *(src++);
    }
}

或者类似于这样的东西。

最难向量化的是complex_float_product()函数。如果你用1.0f填充最终向量中未使用的元素作为实部，用0.0f填充虚部，就可以轻松地计算每个向量的复数积。请记住：

(a + b i) × (c + d i) = (a c - b d) + (a d + b c) i

这里的难点在于高效地计算向量中元素的复数积。幸运的是，这部分对于整体性能并不是很关键（除了非常短的向量或具有非常少列的矩阵），因此在实践中它应该并不重要。

（简而言之，“困难”的部分是找到重新排序元素以最大限度地使用打包向量乘法的方法，而不需要太多的洗牌/移动来减慢速度。）

对于float_add_scaled()函数，您应该创建一个填充有比例因子的向量，类似于以下内容：

void float_add_scaled(float *array, const float *source, float scale, size_t count)
{
    const vec4f  coeff = { scale, scale, scale, scale };
    vec4f       *ptr = (vec4f *)__builtin_assume_aligned((void *)array, 16);
    vec4f *const end = (vec4f *)__builtin_assume_aligned((void *)(array + count), 16);
    const vec4f *src = (vec4f *)__builtin_assume_aligned((void *)source, 16);

    while (ptr < end)
        *(ptr++) += *(src++) * coeff;
}

如果我们忽略对齐和大小的检查，以及备用实现。