比特运算中计算奇偶校验的最快方法是什么？

Question

比特运算中计算奇偶校验的最快方法是什么？

4

我的解决方案（对于输入块的每个比特，都有这样一行）：

*parity ^= (((x[0] >> 30) & 0x00000001) * 0xc3e0d69f);

所有类型均为 uint32。此行代码获取输入 x 的第二位，将其移动到 LSB 并将所有其他位设置为零。然后，将32位奇偶校验值与相应的该位奇偶校验值进行异或运算。

我发现这个乘法解决方案是执行条件异或的最快方法。有更快的方法吗？

- vls

1

你在谈论哪种语言/处理器？ - Oded

你要计算奇偶校验的块有多大？为什么要乘以0xc3e0d69f？ - The Archetypal Paul

我在谈论C++。数据块为256位（因此uint32 x [0..7]）。奇偶校验位为32位（存储在uint32中）。对于输入的每个设置位，都会将特定的XOR-Mask应用于奇偶校验字段（LFSR的并行实现）。 - vls

那个神奇的数字（0xc3E0d69F）来自哪里？来自 Hacker's Delight 吗？ - Peter Mortensen

更普遍（而且更慢）的情况是 *在32位整数中计算设置位的数量*。 - Peter Mortensen

6个回答

3

我不完全理解你所指的奇偶校验类型，但如果这行代码实现了你想要的功能，那么它可能还有改进的空间。

通用规则：对于{x,1}中的x，{x * N == -x & N}。

这是因为对于0来说，-x会将所有位都重置，而对于1来说，-1会将所有位都设置。因此，原始代码行可以重写为：

*parity ^= (-((x[0] >> 30) & 0x00000001) & 0xc3e0d69f);

许多微处理器上计算速度比乘法更快的是两个操作，但您应该检查一下。

此外，代码可以利用有符号右移。

*parity ^= (((int32_t)x[0] << 1 >> 31) & 0xc3e0d69f);

首先将第30位移动到第31位，即符号位，然后将符号位扩展到其他所有位上，因为在大多数机器上，右移操作相当于向下取整（x / 2^N），因此用符号位填充移位后的位（abc...yz>>3 == aaaabc...yz）。

但是这些技巧在C标准中被定义为未定义行为，因此不具备可移植性。请谨慎使用。

- Vovanium

1

奇偶校验计算任务相当于计数。它也被称为“计算设置位数”，“人口统计”或简单地称为popcount。一些处理器有一个高效的指令来计算它（POPCNT，VCNT）。

我建议使用popcount的最低位。

可以通过内联汇编程序或使用内置函数访问：

__builtin_popcount() / __popcnt() / std::bitset :: count()

适用于GCC，Visual Studio和C ++。

个人而言，我更喜欢使用__builtin_parity()将此工作交给编译器。

- kimstik

1

有些处理器可以为您完成此操作。请参见x86的奇偶校验标志。

- nmichaels

它不仅仅是一个偶数/奇数校验位，它有32位。 - vls

你觉得怎么样？是使用汇编语言替代C语言吗？还是使用一些内嵌汇编？也许可以通过提供一个工作示例来扩展答案？ - Peter Mortensen

我建议查看你所针对的处理器，看看它是否具有为你计算奇偶校验的功能。如果有的话，请查看你的工具链是否已经暴露出来。编译器通常会有像这样单个指令的原语。当然，关于过早优化的一般警告也适用。以最快的方式完成并不总是最明智的方式。特别是在需要可移植的代码上。 - nmichaels

0

如果我理解问题正确，您正在进行

for (i = 0; i < 32; i++)
    *parity ^= (((x[0] >> i) & 1) * SOME_CONST[i]);

如果是这样，最好使用查找表：

for (i = 0; i < 4; i++)
    *parity ^= PARITY_LUT[i][ (x[0] >> (i*8)) & 0xFF];

它将需要256千字节，但速度会更快。

- ruslik

一个查找表大约需要10条指令，即使该表位于L1缓存中。 - undefined

0

斯坦福的bithacks代码非常不够优化。为了完整起见，这里是在所有平台和环境中最好的代码：

#include <stdint.h>

#if !defined(__wasm__) && defined(__cplusplus) && defined(__cpp_lib_bitops)
#   include <bit>
#   define paritytree_parity32(x) (std::popcount(x) & 1)
#elif !defined(__wasm__) && defined(__has_builtin)
#   if __has_builtin(__builtin_parity)
#       if INT_MAX >= INT32_MAX
#           define paritytree_parity32(x) __builtin_parity(x)
#       else
#           define paritytree_parity32(x) __builtin_parityl(x)
#       endif
#   endif
#endif
#if !defined(__wasm__) && defined(_MSC_VER) && defined(_M_X64) && !defined(paritytree_parity32)
#   include <intrin.h>
#   define paritytree_parity32(x) __popcnt(x)
#elif !defined(paritytree_parity32)
    static inline uint32_t paritytree_parity32(uint32_t v) {
        // On x86: 2 lea, 2 and, 1 xor, 1 mul, 1 shr, and 0 mov(!!!)
        // On arm64: 2 and, 1 add, 1 eor, 1 mul, 1 lsr, and 1 mov(!!!)
        v = (v ^ (v << 1)) & 0xAAAAAAAA;
        return (v*5 & UINT32_C(0x88888888)) * UINT32_C(0x11111111) >> 31;
    }
#endif

以下是用于显示汇编代码的测试C代码：

uint8_t fastParity(uint32_t x) {
    return paritytree_parity32(x);
}

在x86_64架构上，Clang 17和GCC 13都能生成这个精彩的代码。请注意它如何利用x86的16位半寄存器来减少2条移动和移位指令。还请注意`setnp`，这是一个鲜为人知的x86宝石，它获取由xor操作影响的奇偶标志寄存器的值。没错！每个16位xor操作都会根据目标结果的奇偶性设置奇偶标志。请参阅https://c9x.me/x86/html/file_module_x86_id_288.html。

; Clang 17 and GCC 13 on x86_64
fastParity:
        mov     eax, edi
        shr     eax, 16
        xor     eax, edi
        xor     al, ah
        setnp   al
        ret

在x86_64上使用MSVC 2019编译，它会产生这个结果。请注意，popcnt至少需要2008年的英特尔Nehalem（第一代）处理器，而Windows 10至少需要2017年的英特尔Coffeelake（第八代）处理器，因此在不支持的硬件上运行不是一个问题。

; MSVC 2019 on x86_64
fastParity PROC
        popcnt  eax, ecx
        ret     0
fastParity ENDP

GCC在arm64上生成了这个：

; gcc 13 on arm64
fastParity:
        fmov    s0, w0
        cnt     v0.8b, v0.8b
        addv    b0, v0.8b
        fmov    w0, s0
        and     w0, w0, 1
        ret

在arm64上，Clang生成了以下内容：

; clang 17 on arm64
fastParity:                          // @fastParity
        eor     w8, w0, w0, lsr #16
        eor     w8, w8, w8, lsr #8
        eor     w8, w8, w8, lsr #4
        eor     w8, w8, w8, lsr #2
        eor     w8, w8, w8, lsr #1
        and     w0, w8, #0x1
        ret

MSVC在arm64上生成了以下内容：

; MSVC 2019 on arm64
|fastParity| PROC
        eor         w8,w0,w0,lsl #1
        and         w8,w8,#0xAAAAAAAA
        add         w8,w8,w8,lsl #2
        and         w9,w8,#0x88888888
        mov         w8,#0x11111111
        mul         w8,w9,w8
        lsr         w0,w8,#0x1F
        ret

        ENDP  ; |fastParity|

生成的WebAssembly是：

; WebAssembly
 (func $fastParity (; 1 ;) (param $0 i32) (result i32)
  (i32.shr_u
   (i32.mul
    (i32.and
     (i32.mul
      (i32.and
       (i32.xor
        (i32.shl
         (get_local $0)
         (i32.const 1)
        )
        (get_local $0)
       )
       (i32.const -1431655766)
      )
      (i32.const 5)
     )
     (i32.const -2004318072)
    )
    (i32.const 286331153)
   )
   (i32.const 31)
  )
 )

这个WebAssembly没有内存或浮点数访问，不需要额外的检查器代码，所以它应该编译成这个漂亮小巧的x86_64。

; Webassembly ran on a x86_64
fastParity:
        lea     eax, [rdi+rdi]
        xor     eax, edi
        and     eax, -1431655766
        lea     eax, [rax+rax*4]
        and     eax, -2004318072
        imul    eax, eax, 286331153
        shr     eax, 31
        ret

而且这是在arm64上的情况：

; Webassembly ran on an arm64
fastParity:
        eor     w9, w0, w0, lsl #1
        mov     w8, #286331153
        and     w9, w9, #0xaaaaaaaa
        add     w9, w9, w9, lsl #2
        and     w9, w9, #0x88888888
        mul     w8, w9, w8
        lsr     w0, w8, #31
        ret

为了比较，这里是Stanford bithacks页面上非常次优的C代码和生成的汇编代码。有3条额外的汇编指令，并且没有测试更快的硬件特定奇偶指令。

uint32_t stanfordParityDoNotUse(uint32_t v) {
    v ^= v >> 1;
    v ^= v >> 2;
    v = (v & 0x11111111U) * 0x11111111U;
    return (v >> 28) & 1;
}
stanfordParityDoNotUse:
        mov     edx, edi
        shr     edx
        xor     edx, edi
        mov     eax, edx
        shr     eax, 2
        xor     eax, edx
        and     eax, 286331153
        imul    eax, eax, 286331153
        shr     eax, 28
        and     eax, 1
        ret

- Jack G

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- The Archetypal Paul · Accepted Answer

5

查看并行计算奇偶性，了解一些有趣的技巧，可用于计算字、位等的奇偶性。

- The Archetypal Paul

谢谢 - 我已经尝试了页面上的提示。但是，“有条件地设置或清除位而不使用分支”特别比乘法解决方案慢。 - vls

这是一个仅包含链接的回答。 - Peter Mortensen

另一个答案使用了类似的方法。 - Peter Mortensen