AVX2：计算512个浮点数组的点积

Question

AVX2：计算512个浮点数组的点积

20

我先说明一下，我对SIMD指令是一个完全的新手。

基本上，我的CPU支持AVX2指令（Intel(R) Core(TM) i5-7500T CPU @ 2.70GHz）。我想知道计算大小为512的两个std::vector<float>的点积的最快方法。

我在网上找到了这个和这个，以及这个 stack overflow问题建议使用以下函数__m256 _mm256_dp_ps(__m256 m1, __m256 m2, const int mask);，但是这些都提出了不同的进行点积的方式，我不确定哪种是正确的（也是最快的）。

特别是，我正在寻找执行大小为512的向量点积的最快方法（因为我知道向量大小会影响实现）。

谢谢您的帮助。

编辑 1：我对-mavx2 gcc标志有些困惑。如果我使用这些AVX2函数，编译时是否需要添加该标志？此外，如果我编写一个简单的点积实现，使用-OFast gcc标志，gcc能否为我执行这些优化？

编辑 2 如果有人有时间和精力，我非常感谢您能编写完整的实现。我相信其他初学者也会珍视这些信息。

- cyrusbehr

4

你只想在类似于3或4个元素点积的情况下使用dpps指令。对于更大的数组，你需要使用垂直FMA运算到多个累加器中（参考：为什么Haswell架构上mulss只需要3个周期，与Agner's instruction tables不同？（使用多个累加器展开FP循环））。你需要使用 -mfma -mavx2 或者更好的 -march=native选项。是的，你需要启用目标选项来使用任何内嵌函数，并且加上 -O3 选项。 - Peter Cordes

如何正确使用预取指令？展示了一个普通的SIMD点积的内部循环。 - Peter Cordes

除非您使用OpenMP或'-ffast-math'告诉编译器将FP数学视为可交换的，否则'-O3'无法自动矢量化一个天真的点积。 - Peter Cordes

1

"-Ofast" 目前等同于 "-O3 -ffast-math"，所以是可以的。但不幸的是，即使 GCC 对 FP 循环进行展开，它也不会使用多个累加器，因此您将会受到 SIMD FMA 延迟而不是吞吐量的瓶颈影响。（在 Skylake 上是 8 倍） - Peter Cordes

3

Soonts的回答恰好描述了我所说的内容。它需要AVX+FMA，而不是AVX2。感谢@Soonts写下这篇文章。我认为还有其他问答库中有一个良好的标准点积实现，我之前正在寻找但没有立即找到。 - Peter Cordes

显示剩余2条评论

2个回答

2

另一种选择是通过内置函数使用矢量指令，它与CPU无关且编码简单，自动使用融合乘加操作，但可能不如使用特定于CPU的手动编码版本高效。然而，使用矢量内置函数通常在劳动成本上更加便宜，比原始标量版本更快，并可作为手工优化的基准。

一个例子：

#include <vector>
#include <utility>
#include <cassert>

template<class V, int... i>
auto hsum(V v, std::integer_sequence<int, i...>) noexcept {
    return (v[i] + ...);
}

template<class V>
auto hsum(V v) noexcept {
    return hsum(v, std::make_integer_sequence<int, sizeof v / sizeof v[0]>{});
}

float dotProductFma(const std::vector<float>& a, const std::vector<float>& b) noexcept {
    constexpr int SIMD_BYTES = 64;
    using T = float;
    // Unaligned vector, uses unaligned loads.
    using V = T __attribute__((vector_size(SIMD_BYTES), aligned(alignof(T)))); 
    constexpr auto N = sizeof(V) / sizeof(T);

    auto s = a.size();
    assert(s % N == 0); // Assumes size is a multiple of N;
    decltype(s) i;
    auto* va = reinterpret_cast<V const*>(a.data());
    auto* vb = reinterpret_cast<V const*>(b.data());

    V vdot = {};
    for(i = 0; i < s / N; ++i)
        vdot += va[i] * vb[i];

    return hsum(vdot);
}

生成了汇编代码。使用-ffast-math选项生成的代码更紧凑。

累加点积到float可能会因为float的有限精度而遭受灾难性的抵消，当和的项的大小差异足够大时。可以通过使用多个和累加器或/和累加到双精度向量来修复。

- Maxim Egorushkin

1

是的，我也见过这种效果，先向前一步，然后向后一步，以在转向后获得L1d和L2的命中。如果我没记错的话，在Skylake服务器刚发布时，无论我测试的大小是多少，这个效果超过了1%。当然，这仍然是在缓存阻塞方面承认失败，并且在大部分点积计算中受到内存带宽的限制。这很糟糕，这是一个计算强度非常低的问题，每个FMA有两个加载操作。 - undefined

1

@PeterCordes 我错了，你说得很对：对齐到_L1d缓存行大小_边界。 - undefined

1

循环本身看起来完全正常，这是编译器多年来一直在做的事情。在我打开Godbolt之前，我对此非常有信心。高效的洗牌模式很好，并且在我报告编译器错过优化的bug后，它们在这些年里得到了改进 :) 请注意，大小是一个常量10240，所以它不需要生成任何标量清理代码或在循环之前进行任何检查，例如n是否小于1个向量。（而现代的SIMD指令集不需要对齐加载，只是在对齐时性能稍微好一些，同样避免了首先检查指针的需要。） - undefined

1

但是，确实，自动向量化一个归约操作并不容易，而且20年前的GCC可能根本无法做到这一点，我不知道。而且，Clang使用多个累加器展开循环的方式非常令人印象深刻，根据循环中的代码大小选择展开因子。更复杂的循环体只展开2次或者根本不展开。看看GCC的愚蠢展开，使用相同的累加器来节省前端吞吐量，但对于FP延迟没有帮助，很容易看出有很多错误的地方。不过，编译器对于整数来说可以做到这一点，因为它们是可结合的。 - undefined

1

是的，我所说的对齐负载就是指在运行时具有对齐地址的负载。自从 Nehalem 架构的 Intel 处理器和 Bulldozer 架构的 AMD 处理器以来，当地址对齐时，movups 指令没有任何惩罚。（对于加载而非存储，我记得 K10 架构也是如此。）我应该说“对齐数组”，这样更清楚地描述了数据/指针，而不是指令。 - undefined

显示剩余20条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Soonts · Accepted Answer

_mm256_dp_ps只适用于2到4个元素的点积运算；对于更长的向量，请在循环中使用垂直SIMD并在最后缩减为标量。在循环中使用_mm256_dp_ps和_mm256_add_ps会慢得多。

GCC和clang要求您启用（通过命令行选项）您使用内部函数的ISA扩展，而不像MSVC和ICC那样。

以下代码可能接近您的CPU的理论性能极限。未经测试。

使用clang或gcc -O3 -march=native进行编译。（至少需要-mavx -mfma，但-mtune选项由-march隐含，也可以使用其他-mpopcnt和其他arch=native启用的选项。调整选项对于大多数带有FMA的CPU来说是编译效率的关键，特别是-mno-avx256-split-unaligned-load：为什么gcc不能将_mm256_loadu_pd解析为单个vmovupd？）

或者使用MSVC -O2 -arch:AVX2进行编译。

#include <immintrin.h>
#include <vector>
#include <assert.h>

// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.

// Multiply 8 floats by another 8 floats.
template<int offsetRegs>
inline __m256 mul8( const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_mul_ps( a, b );
}

// Returns acc + ( p1 * p2 ), for 8-wide float lanes.
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
    constexpr int lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_fmadd_ps( a, b, acc );
}

// Compute dot product of float vectors, using 8-wide FMA instructions.
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
    assert( a.size() == b.size() );
    assert( 0 == ( a.size() % 32 ) );
    if( a.empty() )
        return 0.0f;

    const float* p1 = a.data();
    const float* const p1End = p1 + a.size();
    const float* p2 = b.data();

    // Process initial 32 values. Nothing to add yet, just multiplying.
    __m256 dot0 = mul8<0>( p1, p2 );
    __m256 dot1 = mul8<1>( p1, p2 );
    __m256 dot2 = mul8<2>( p1, p2 );
    __m256 dot3 = mul8<3>( p1, p2 );
    p1 += 8 * 4;
    p2 += 8 * 4;

    // Process the rest of the data.
    // The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
    // Unrolling manually for 2 reasons:
    // 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
    // 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
    while( p1 < p1End )
    {
        dot0 = fma8<0>( dot0, p1, p2 );
        dot1 = fma8<1>( dot1, p1, p2 );
        dot2 = fma8<2>( dot2, p1, p2 );
        dot3 = fma8<3>( dot3, p1, p2 );
        p1 += 8 * 4;
        p2 += 8 * 4;
    }

    // Add 32 values into 8
    const __m256 dot01 = _mm256_add_ps( dot0, dot1 );
    const __m256 dot23 = _mm256_add_ps( dot2, dot3 );
    const __m256 dot0123 = _mm256_add_ps( dot01, dot23 );
    // Add 8 values into 4
    const __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0123 ), _mm256_extractf128_ps( dot0123, 1 ) );
    // Add 4 values into 2
    const __m128 r2 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
    // Add 2 lower values into the final result
    const __m128 r1 = _mm_add_ss( r2, _mm_movehdup_ps( r2 ) );
    // Return the lowest lane of the result vector.
    // The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
    return _mm_cvtss_f32( r1 );
}

可能的进一步改进：

1. 使用8个向量而不是4个进行展开。我已经检查了gcc 9.2 asm output，编译器只使用了16个可用的向量寄存器中的8个。

2. 确保输入向量都对齐，例如使用自定义分配器，在msvc上调用_aligned_malloc / _aligned_free，或在gcc和clang上使用aligned_alloc / free。然后将_mm256_loadu_ps替换为_mm256_load_ps。

要自动向量化一个简单的标量点积，你还需要使用OpenMP SIMD或者`-ffast-math`（由`-Ofast`隐含）让编译器将浮点数运算视为可结合的，即使实际上并不是（因为舍入误差）。但是GCC在自动向量化时不会使用多个累加器，即使它进行了展开，所以你的瓶颈将是FMA延迟，而不是加载吞吐量。

（每个FMA需要2个加载操作，这意味着这段代码的吞吐量瓶颈是向量加载，而不是实际的FMA运算。）

2023年更新：由于这个答案收到了很多赞，这里提供另一个版本，支持任意长度的向量，不一定是32个元素的倍数。主循环是一样的，不同之处在于处理余数的方式。

正如你所见，以一种既高效又公平的方式处理余数是相对棘手的，这涉及到求和的顺序对结果的数值精度有影响。实现的关键部分是_mm256_maskload_ps条件加载指令。

#include <immintrin.h>
#include <vector>
#include <algorithm>
#include <assert.h>
#include <stdint.h>

// CPUs support RAM access like this: "ymmword ptr [rax+64]"
// Using templates with offset int argument to make easier for compiler to emit good code.

// Returns acc + ( p1 * p2 ), for 8 float lanes
template<int offsetRegs>
inline __m256 fma8( __m256 acc, const float* p1, const float* p2 )
{
    constexpr ptrdiff_t lanes = offsetRegs * 8;
    const __m256 a = _mm256_loadu_ps( p1 + lanes );
    const __m256 b = _mm256_loadu_ps( p2 + lanes );
    return _mm256_fmadd_ps( a, b, acc );
}

#ifdef __AVX2__
inline __m256i makeRemainderMask( ptrdiff_t missingLanes )
{
    // This is not a branch, compiles to conditional move
    missingLanes = std::max( missingLanes, (ptrdiff_t)0 );
    // Make a mask of 8 bytes
    // No need to clip for missingLanes <= 8 because the shift is already good, results in zero
    uint64_t mask = ~(uint64_t)0;
    mask >>= missingLanes * 8;
    // Sign extend these bytes into int32 lanes in AVX vector
    __m128i tmp = _mm_cvtsi64_si128( (int64_t)mask );
    return _mm256_cvtepi8_epi32( tmp );
}
#else
// Aligned by 64 bytes
// The load will only touch a single cache line, no penalty for unaligned load
static const int alignas( 64 ) s_remainderLoadMask[ 16 ] = {
    -1, -1, -1, -1, -1, -1, -1, -1, 0, 0, 0, 0, 0, 0, 0, 0 };
inline __m256i makeRemainderMask( ptrdiff_t missingLanes )
{
    // These aren't branches, they compile to conditional moves
    missingLanes = std::max( missingLanes, (ptrdiff_t)0 );
    missingLanes = std::min( missingLanes, (ptrdiff_t)8 );
    // Unaligned load from a constant array
    const int* rsi = &s_remainderLoadMask[ missingLanes ];
    return _mm256_loadu_si256( ( const __m256i* )rsi );
}
#endif

// Same as fma8(), load conditionally using the mask
// When the mask has all bits set, an equivalent of fma8(), but 1 instruction longer
// When the mask is a zero vector, the function won't load anything, will return `acc`
template<int offsetRegs>
inline __m256 fma8rem( __m256 acc, const float* p1, const float* p2, ptrdiff_t rem )
{
    constexpr ptrdiff_t lanes = offsetRegs * 8;
    // Generate the mask for conditional loads
    // The implementation depends on whether AVX2 is enabled with compiler switches
    const __m256i mask = makeRemainderMask( ( 8 + lanes ) - rem );
    // These conditional load instructions produce zeros for the masked out lanes
    const __m256 a = _mm256_maskload_ps( p1 + lanes, mask );
    const __m256 b = _mm256_maskload_ps( p2 + lanes, mask );
    return _mm256_fmadd_ps( a, b, acc );
}

// Compute dot product of float vectors, using 8-wide FMA instructions
float dotProductFma( const std::vector<float>& a, const std::vector<float>& b )
{
    assert( a.size() == b.size() );
    const size_t length = a.size();
    if( length == 0 )
        return 0.0f;

    const float* p1 = a.data();
    const float* p2 = b.data();
    // Compute length of the remainder; 
    // We want a remainder of length [ 1 .. 32 ] instead of [ 0 .. 31 ]
    const ptrdiff_t rem = ( ( length - 1 ) % 32 ) + 1;
    const float* const p1End = p1 + length - rem;

    // Initialize accumulators with zeros
    __m256 dot0 = _mm256_setzero_ps();
    __m256 dot1 = _mm256_setzero_ps();
    __m256 dot2 = _mm256_setzero_ps();
    __m256 dot3 = _mm256_setzero_ps();

    // Process the majority of the data.
    // The code uses FMA instructions to multiply + accumulate, consuming 32 values per loop iteration.
    // Unrolling manually for 2 reasons:
    // 1. To reduce data dependencies. With a single register, every loop iteration would depend on the previous result.
    // 2. Unrolled code checks for exit condition 4x less often, therefore more CPU cycles spent computing useful stuff.
    while( p1 < p1End )
    {
        dot0 = fma8<0>( dot0, p1, p2 );
        dot1 = fma8<1>( dot1, p1, p2 );
        dot2 = fma8<2>( dot2, p1, p2 );
        dot3 = fma8<3>( dot3, p1, p2 );
        p1 += 32;
        p2 += 32;
    }

    // Handle the last, possibly incomplete batch of length [ 1 .. 32 ]
    // To save multiple branches, we load that entire batch with `vmaskmovps` conditional loads
    // On modern CPUs, the performance of such loads is pretty close to normal full vector loads
    dot0 = fma8rem<0>( dot0, p1, p2, rem );
    dot1 = fma8rem<1>( dot1, p1, p2, rem );
    dot2 = fma8rem<2>( dot2, p1, p2, rem );
    dot3 = fma8rem<3>( dot3, p1, p2, rem );

    // Add 32 values into 8
    dot0 = _mm256_add_ps( dot0, dot2 );
    dot1 = _mm256_add_ps( dot1, dot3 );
    dot0 = _mm256_add_ps( dot0, dot1 );
    // Add 8 values into 4
    __m128 r4 = _mm_add_ps( _mm256_castps256_ps128( dot0 ),
        _mm256_extractf128_ps( dot0, 1 ) );
    // Add 4 values into 2
    r4 = _mm_add_ps( r4, _mm_movehl_ps( r4, r4 ) );
    // Add 2 lower values into the scalar result
    r4 = _mm_add_ss( r4, _mm_movehdup_ps( r4 ) );

    // Return the lowest lane of the result vector.
    // The intrinsic below compiles into noop, modern compilers return floats in the lowest lane of xmm0 register.
    return _mm_cvtss_f32( r4 );
}