如何创建一个左对齐的向量，其中包含一个SIMD向量中0索引的序列？

Question

如何创建一个左对齐的向量，其中包含一个SIMD向量中0索引的序列？

c++csimdavx2

5

请告诉我，我自己无法弄清楚：

这里有一个__m128iSIMD向量 - 每个16字节中包含以下值：

1 0 1 1 0 1 0 1 1 1 0 1 0 1 0 1

是否可能对此向量进行某种转换，以便删除所有的1，并且零的位置是该零元素在向量中的编号。也就是说，如下所示：

0   1   2   3   4   5   6   7   8   9   10  11  12  13  14  15
                                                            
1   0   1   1   0   1   0   1   1   1   0   1   0   1   0   1
                                                            
    1           4       6               10      12     14

最后得到一个只包含这些值的向量：

1  4  6  10  12  14

什么逻辑可以获得这样的结果？应该使用什么 SIMD 指令？

PS：我刚开始学习 SIMD - 所以我不太了解，并且不理解。

- Optimus1

2

可以。AVX512VBMI2 可用吗？那会使它变得非常容易。没有它也是可能的，但不如有它容易。 - harold

所以如果我理解正确的话，您想将整个向量清零，但在此之前，使用初始零的位置数字来形成一个新的向量，这些位置数字？ - Arkoudinos

2

你有哪些SIMD指令集可用？x86-64带有AVX2吗？我猜应该是某种x86，而不是AArch64 ASIMD，因为你说了“__m128i”。 - Peter Cordes

你能在Intel Haswell或AMD Zen3上使用快速的BMI2 pdep / pext吗？你可能希望使用BMI2 pext来进行每次8字节的左包装操作，类似于AVX2，基于掩码打包的最有效方法是什么？ - 没有AVX512，您没有SIMD左包装，并且一个2^16个__m128i洗牌掩码表显然是可怕的。（或者我想说就是常量传播后的最终结果的__m128i）。但无论如何，64K x 16字节将是一个巨大的查找表，几乎每次都会缺失缓存。 - Peter Cordes

2

将比较结果表示为 0 和 1 而不是通常的 0 / 0xff 有点奇怪。请注意，Soonts 的答案通过比较将您的 1 转换为 0xff。如果您的先前代码自然产生 0/1 而不是 0/-1，那么就没问题了，但如果您需要额外的工作来实现这一点，请不要这样做。 - Peter Cordes

显示剩余6条评论

3个回答

3

水平的数据相关内容很难处理。这不是传统的SIMD构建块擅长的领域。这是一个棘手的问题，需要花费时间学习SIMD技术。

如果您拥有AVX512VBMI2（Ice Lake）处理器，vpcompressb指令可以在常量上执行此操作。（好吧，算两个，包括输入的test-into-mask。）
或者使用AVX-512BW（Skylake-avx512）处理器，您可以在16个uint32_t的常量向量上使用vpcompressd，然后使用vpmovdb压缩字节向量后将其打包为__m512i。（在相同的字节向量测试-into-mask之后。）

16个单独元素意味着单个表查找不可行；2^16 x __m128i将是64K x 16字节=1 MiB，大多数访问会错过缓存。(代码很简单;只需对零或_mm_slli_epi32(v,7) / _mm_movemask_epi8执行_mm_cmpeq_epi8，并使用该16位掩码作为数组索引)。

可能每次使用4个掩码位的4字节块进行4次查找可以起作用。 (使用SWAR添加0x04040404 / 0x08080808 / 0x0c0c0c0c来偏移结果)。您的表还可以存储偏移值，或者您可以_lzcnt_u32或其他方法来确定指针需要增加多少，直到下一个存储，或者_popcnt_u32(zpos&0xf)。

#include <stdint.h>
#include <immintrin.h>
#include <stdalign.h>
#include <string.h>

// untested but compiles ok
char *zidx_SSE2(char *outbuf, __m128i v)
{
   alignas(64) static struct __attribute__((packed)) {
       uint32_t idx;
       uint8_t count;  // or make this also uint32_t, but still won't allow a memory-source add unless it's uintptr_t.  Indexing could be cheaper in a PIE though, *8 instead of *5 which needs both base and idx
   }lut[] = { // 16x 5-byte entries
      /*[0b0000]=*/ {0, 0}, /* [0b0001]= */ {0x00000000, 1}, /* [0b0010]= */ {0x00000001, 1 },
      //...  left-packed indices, count of non-zero bits
              /* [0b1111]=*/ {0x03020100, 4}
    };
    // Maybe pack the length into the high 4 bits, and mask?  Maybe not, it's a small LUT

   unsigned zpos = _mm_movemask_epi8(_mm_cmpeq_epi8(v, _mm_setzero_si128()));
   for (int i=0 ; i<16 ; i+=4){
       uint32_t idx = lut[zpos & 0xf].idx;
       idx += (0x01010101 * i);  // this strength-reduces or is a constant after fully unrolling.  GCC -O2 even realizes it can use add reg, 0x04040404 *as* the loop counter; clang -fno-unroll-loops doesn't
       // idxs from bits 0..3, bits 4..7, bits 8..11, or bits 12..15
       memcpy(outbuf, &idx, sizeof(idx));   // x86 is little-endian.  Aliasing safe unaligned store.
       outbuf += lut[zpos & 0xf].count;  // or popcount(zpos&0xf)
       zpos >>= 4;
   }
   return outbuf;  // pointer to next byte after the last valid idx
}

https://godbolt.org/z/59Ev1Tz37展示了没有使用循环展开的GCC和clang。gcc -O3完全展开它，而默认情况下-O2的clang也是如此。

它永远不会将超过16个字节存储到outbuf中，但对于少于16个零字节的输入，则存储的字节数更少。（但是即使在这个块中实际索引为零，每次对outbuf的存储也是4个字节宽的。）如果所有输入向量字节都是 0 ，则4个存储不会有任何重叠，否则它们将（部分或完全）重叠。这没关系；缓存和存储缓冲区可以轻松吸收这些。

SIMD向量是固定宽度的，所以我不确定您所说的输出仅具有这些值的含义。高字节必须是某些值；如果您想要零，则可以首先将outbuf清零。请注意，如果在4个32位存储器写入后立即重新加载到__m128i向量中，这将导致存储转发停顿（额外延迟）。这并不是一场灾难，但也不是很好。最好直接将其写入实际输出中。

BMI2 `pext`是一种水平打包操作

您在评论中提到，您希望在支持AVX2的i7上使用此操作。
这也意味着您拥有快速的BMI2 pext / pdep (Intel自Haswell以来，AMD自Zen3以来)。较早的AMD支持这些指令，但速度不够快。它们在整数寄存器中对uint64_t执行与vpcompressb / vpexpandb等效的位运算。

这可以允许类似于AVX2 what is the most efficient way to pack left based on a mask?的技巧
将向量转换为0 / 0xf半字节掩码后，我们可以使用一个pext指令将相应的值为0..15的半字节提取到整数寄存器的底部。

或者可能保持最小字节打包以避免将半字节解包回字节，因此您需要两个单独的8字节左打包操作，并需要popcnt或lzcnt来确定它们应如何重叠。

您的pext操作数将是从_mm_cmpeq_epi8(v，_mm_setzero_si128())提取的0/0xff字节，分别用lo = _mm_cvtsi128_si64(cmp)和hi = _mm_extract_epi64(cmp，1) 两个uint64_t半部分。

像LUT版本一样，使用memcpy作为未对齐别名安全存储。

- Peter Cordes

3

稍作定制，源自这里。

该 SSSE3 策略处理 64 位字，然后将结果重新组合成 128 位字。在 xmm 寄存器中合并 64 位的两半比使用重叠写入进行压缩存储到内存更加昂贵。

/// `v` input bytes are either a 1 or 0
/// `v` output bytes are the "compressed" indices of zeros locations in the input
///     unused leading bytes in the output are filled with garbage.
/// returns the number of used bytes in `v`
static inline
size_t zidx_SSSE3 (__m128i &v) {

    static const uint64_t table[27] = { /* 216 bytes */
        0x0000000000000706, 0x0000000000070600, 0x0000000007060100, 0x0000000000070602,
        0x0000000007060200, 0x0000000706020100, 0x0000000007060302, 0x0000000706030200,
        0x0000070603020100, 0x0000000000070604, 0x0000000007060400, 0x0000000706040100,
        0x0000000007060402, 0x0000000706040200, 0x0000070604020100, 0x0000000706040302,
        0x0000070604030200, 0x0007060403020100, 0x0000000007060504, 0x0000000706050400,
        0x0000070605040100, 0x0000000706050402, 0x0000070605040200, 0x0007060504020100,
        0x0000070605040302, 0x0007060504030200, 0x0706050403020100
    };

    const __m128i id = _mm_set_epi8(15, 14, 13, 12, 11, 10, 9, 8, 7, 6, 5, 4, 3, 2, 1, 0);

    // adding 8 to each shuffle index is cheaper than extracting the high qword
    const __m128i offset = _mm_cvtsi64_si128(0x0808080808080808);

    // bits[4:0] = index -> ((trit_d * 0) + (trit_c * 9) + (trit_b * 3) + (trit_a * 1))
    // bits[15:7] = popcnt
    const __m128i sadmask = _mm_set1_epi64x(0x8080898983838181);

    // detect 1's (spaces)
    __m128i mask = _mm_sub_epi8(_mm_setzero_si128(), v);
    
    // manually process 16-bit lanes to reduce possible combinations
    v = _mm_add_epi8(v, id);

    // extract bitfields describing each qword: index, popcnt
    __m128i desc = _mm_sad_epu8(_mm_and_si128(mask, sadmask), sadmask);
    size_t lo_desc = (size_t)_mm_cvtsi128_si32(desc);
    size_t hi_desc = (size_t)_mm_extract_epi16(desc, 4);

    // load shuffle control indices from pre-computed table
    __m128i lo_shuf = _mm_loadl_epi64((__m128i*)&table[lo_desc & 0x1F]);
    __m128i hi_shuf = _mm_or_si128(_mm_loadl_epi64((__m128i*)&table[hi_desc & 0x1F]), offset);

    //// recombine shuffle control qwords ////
    // emulates a variable `_mm_bslli_si128(hi_shuf, lo_popcnt)` operation
    desc = _mm_srli_epi16(desc, 7); // isolate popcnts
    __m128i shift = _mm_shuffle_epi8(desc, _mm_setzero_si128()); // broadcast popcnt of low qword
    hi_shuf = _mm_shuffle_epi8(hi_shuf, _mm_sub_epi8(id, shift)); // byte shift left
    __m128i shuf = _mm_max_epu8(lo_shuf, hi_shuf); // merge

    v = _mm_shuffle_epi8(v, shuf);
    return (hi_desc + lo_desc) >> 7; // popcnt
}

如果我们只是为了未来的标量处理而提取这些索引，那么我们可能需要考虑使用 pmovmskb 然后根据需要逐个剥离每个索引。

x = (unsigned)_mm_movemask_epi8(compare_mask);
while (x) {
    idx = count_trailing_zeros(x);
    x &= x - 1; // clear lowest set bit
    DoSomethingTM(idx);
}

- aqrit

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Soonts · Accepted Answer

如果您拥有BMI2，请使用以下版本。

__m128i compressZeroIndices_bmi2( __m128i v )
{
    const __m128i zero = _mm_setzero_si128();
    // Replace zeros with 0xFF
    v = _mm_cmpeq_epi8( v, zero );

    // Extract low/high pieces into scalar registers for PEXT instruction
    uint64_t low = (uint64_t)_mm_cvtsi128_si64( v );
    uint64_t high = (uint64_t)_mm_extract_epi64( v, 1 );

    // Count payload bytes in the complete vector
    v = _mm_sub_epi8( zero, v );
    v = _mm_sad_epu8( v, zero );
    v = _mm_add_epi64( v, _mm_srli_si128( v, 8 ) );
    v = _mm_shuffle_epi8( v, zero );
    // Make a mask vector filled with 0 for payload bytes, 0xFF for padding
    const __m128i identity = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
    v = _mm_max_epu8( v, identity );
    __m128i mask = _mm_cmpeq_epi8( v, identity );

    // The following line requires C++/20
    // If you don't have it, use #ifdef _MSC_VER to switch between __popcnt64() and _popcnt64() intrinsics.
    uint64_t lowBits = std::popcount( low );
    // Use BMI2 to gather these indices
    low = _pext_u64( 0x0706050403020100ull, low );
    high = _pext_u64( 0x0F0E0D0C0B0A0908ull, high );

    // Merge payload into a vector
    v = _mm_cvtsi64_si128( low | ( high << lowBits ) );
    v = _mm_insert_epi64( v, high >> ( 64 - lowBits ), 1 );

    // Apply the mask to set unused elements to -1, enables pmovmskb + tzcnt to find the length
    return _mm_or_si128( v, mask );
}

这是一个没有使用BMI2指令集的另一版本。在大多数CPU上可能会更慢，但代码更简单，并且不使用任何标量指令。

inline __m128i sortStep( __m128i a, __m128i perm, __m128i blend )
{
    // The min/max are independent and their throughput is 0.33-0.5 cycles,
    // so this whole function only takes 3 (AMD) or 4 (Intel) cycles to complete
    __m128i b = _mm_shuffle_epi8( a, perm );
    __m128i i = _mm_min_epu8( a, b );
    __m128i ax = _mm_max_epu8( a, b );
    return _mm_blendv_epi8( i, ax, blend );
}

__m128i compressZeroIndices( __m128i v )
{
    // Replace zeros with 0-based indices, ones with 0xFF
    v = _mm_cmpgt_epi8( v, _mm_setzero_si128() );
    const __m128i identity = _mm_setr_epi8( 0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15 );
    v = _mm_or_si128( v, identity );

    // Sort bytes in the vector with a network
    // https://demonstrations.wolfram.com/SortingNetworks/
    // Click the "transposition" algorithm on that demo
    const __m128i perm1 = _mm_setr_epi8( 1, 0, 3, 2, 5, 4, 7, 6, 9, 8, 11, 10, 13, 12, 15, 14 );
    const __m128i blend1 = _mm_set1_epi16( (short)0xFF00 );
    const __m128i perm2 = _mm_setr_epi8( 0, 2, 1, 4, 3, 6, 5, 8, 7, 10, 9, 12, 11, 14, 13, 15 );
    const __m128i blend2 = _mm_setr_epi8( 0, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0, -1, 0 );
    for( size_t i = 0; i < 8; i++ )
    {
        v = sortStep( v, perm1, blend1 );
        v = sortStep( v, perm2, blend2 );
    }
    return v;
}

提示：如果您想获取输出向量的长度，请使用此函数：

uint32_t vectorLength( __m128i v )
{
    uint32_t mask = (uint32_t)_mm_movemask_epi8( v );
    mask |= 0x10000;
    return _tzcnt_u32( mask );
}

如何创建一个左对齐的向量，其中包含一个SIMD向量中0索引的序列？

BMI2 pext是一种水平打包操作

BMI2 `pext`是一种水平打包操作