在一个位数组中找到最左边被设置的最高位

Question

在一个位数组中找到最左边被设置的最高位

cbit-manipulation32-bit

47

我有一个位数组的实现，其中第0个索引是数组中第一个字节的最高有效位，第8个索引是第二个字节的最高有效位，以此类推...

如何快速找到在这个位数组中设置的第一个位？我查阅了所有相关解决方案，但它们都只能找到第一个最不重要的位，而我需要第一个最重要的位。因此，对于0x00A1，我想要的结果是8（因为它是从左边数的第9个位）。

- Claudiu

1

假设最低有效位是第0位，那么在0x00a1中，不是第7位是设置的最高有效位吗？ - Michael Burr

你的位数组长度是否任意，还是适合于机器字？ - Michael Burr

你对位数组有什么接口？你可以对它执行哪些操作？ - jemfinch

@jemfinch：这是一个C字符数组。 - Claudiu

1

已经有另外一页提供了详细信息... https://dev59.com/DXRB5IYBdhLWcg3wSVcI - Josh

显示剩余2条评论

17个回答

35

tl:dr; 对于32位，请使用de Bruijn乘法。

这是"最快"的可移植算法。它比本主题中所有其他可移植的32位MSB算法都要快得多且更正确。

当输入为零时，de Bruijn算法也会返回正确的结果。当输入为零时，__builtin_clz和_BitScanReverse指令返回不正确的结果。

在Windows x86-64上，de Bruijn乘法的运行速度与等效（有缺陷的）Windows函数相当，性能差异仅约为3％。

以下是代码。

u32 msbDeBruijn32( u32 v )
{
    static const int MultiplyDeBruijnBitPosition[32] =
    {
        0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
        8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
    };

    v |= v >> 1; // first round down to one less than a power of 2
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;

    return MultiplyDeBruijnBitPosition[( u32 )( v * 0x07C4ACDDU ) >> 27];
}

这个帖子中的所有其他答案都比作者所说的运行得更差，或者没有正确计算结果，或者两者兼而有之。让我们对它们进行基准测试，并验证它们是否做到了它们所声称的。

这是一个简单的C++11测试工具，可用于测试所有这些实现。它在Visual Studio上编译干净，但应该适用于所有现代编译器。它允许您在性能模式（bVerifyResults = false）和检查模式（bVerifyResults = true）下运行基准测试。

以下是验证模式下的结果：

Verification failed for msbNative64: input was 0; output was 818af060; expected 0
Verification failed for msbFfs: input was 22df; output was 0; expected d
Verification failed for msbPerformanceJunkie32: input was 0; output was ffffffff; expected 0
Verification failed for msbNative32: input was 0; output was 9ab07060; expected 0

"性能狂热者"和微软本地实现在输入为零时执行不同的操作。msbPerformanceJunkie32产生-1，而Microsoft的_BitScanReverse产生一个随机数，与底层硬件指令一致。此外，msbPerformanceJunkie32实现产生的结果与所有其他答案相差一。以下是在我的i7-4600笔记本电脑上以性能模式运行的结果，编译为发布模式：

msbLoop64 took 2.56751 seconds               
msbNative64 took 0.222197 seconds            

msbLoop32 took 1.43456 seconds               
msbFfs took 0.525097 seconds                 
msbPerformanceJunkie32 took 1.07939 seconds  
msbDeBruijn32 took 0.224947 seconds          
msbNative32 took 0.218275 seconds

de Bruijn版本因为无分支而表现出色，相比其他实现更加高效，在产生均匀分布的输出时运行良好。由于现代CPU上分支预测的惩罚，所有其他版本对于任意输入都较慢。smbFfs函数会产生错误结果，因此可以忽略。

一些实现适用于32位输入，一些适用于64位输入。使用模板将帮助我们进行苹果与苹果的比较，无论输入大小如何。

以下是代码。如果您愿意，请下载并运行基准测试。

#include <iostream>
#include <chrono>
#include <random>
#include <cassert>
#include <string>
#include <limits>

#ifdef _MSC_VER
#define MICROSOFT_COMPILER 1
#include <intrin.h>
#endif // _MSC_VER

const int iterations = 100000000;
bool bVerifyResults = false;
std::random_device rd;
std::default_random_engine re(rd());
typedef unsigned int u32;
typedef unsigned long long u64;

class Timer
{
public:
    Timer() : beg_(clock_::now()) {}
    void reset() {
        beg_ = clock_::now();
    }
    double elapsed() const {
        return std::chrono::duration_cast<second_>
            (clock_::now() - beg_).count();
    }

private:
    typedef std::chrono::high_resolution_clock clock_;
    typedef std::chrono::duration<double, std::ratio<1> > second_;
    std::chrono::time_point<clock_> beg_;
};

unsigned int msbPerformanceJunkie32(u32 x)
{
    static const unsigned int bval[] =
    { 0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4 };
    unsigned int r = 0;
    if (x & 0xFFFF0000) {
        r += 16 / 1;
        x >>= 16 / 1;
    }
    if (x & 0x0000FF00) {
        r += 16 / 2;
        x >>= 16 / 2;
    }
    if (x & 0x000000F0) {
        r += 16 / 4;
        x >>= 16 / 4;
    }
    return r + bval[x];
}

#define FFS(t)  \
{ \
register int n = 0; \
if (!(0xffff & t)) \
n += 16; \
if (!((0xff << n) & t)) \
n += 8; \
if (!((0xf << n) & t)) \
n += 4; \
if (!((0x3 << n) & t)) \
n += 2; \
if (!((0x1 << n) & t)) \
n += 1; \
return n; \
}

unsigned int msbFfs32(u32 x)
{
    FFS(x);
}

unsigned int msbLoop32(u32 x)
{
    int r = 0;
    if (x < 1) return 0;
    while (x >>= 1) r++;
    return r;
}

unsigned int msbLoop64(u64 x)
{
    int r = 0;
    if (x < 1) return 0;
    while (x >>= 1) r++;
    return r;
}

u32 msbDeBruijn32(u32 v)
{
    static const int MultiplyDeBruijnBitPosition[32] =
    {
        0, 9, 1, 10, 13, 21, 2, 29, 11, 14, 16, 18, 22, 25, 3, 30,
        8, 12, 20, 28, 15, 17, 24, 7, 19, 27, 23, 6, 26, 5, 4, 31
    };

    v |= v >> 1; // first round down to one less than a power of 2
    v |= v >> 2;
    v |= v >> 4;
    v |= v >> 8;
    v |= v >> 16;

    return MultiplyDeBruijnBitPosition[(u32)(v * 0x07C4ACDDU) >> 27];
}

#ifdef MICROSOFT_COMPILER
u32 msbNative32(u32 val)
{
    unsigned long result;
    _BitScanReverse(&result, val);
    return result;
}
u32 msbNative64(u64 val)
{
    unsigned long result;
    _BitScanReverse64(&result, val);
    return result;
}
#endif // MICROSOFT_COMPILER

template <typename InputType>
void test(unsigned int msbFunc(InputType),
    const std::string &name,
    const std::vector< InputType > &inputs,
    std::vector< unsigned int > &results,
    bool bIsReference = false
)
{
    if (bIsReference)
    {
        int i = 0;
        for (int i = 0; i < iterations; i++)
            results[i] = msbFunc(inputs[i]);
    }
    InputType result;
    if (bVerifyResults)
    {
        bool bNotified = false;
        for (int i = 0; i < iterations; i++)
        {
            result = msbFunc(inputs[i]);
            if ((result != results[i]) && !bNotified)
            {
                std::cout << "Verification failed for " << name << ": "
                    << "input was " << std::hex << inputs[i]
                    << "; output was " << result
                    << "; expected " << results[i]
                    << std::endl;
                bNotified = true;
            }
        }
    }
    else
    {
        Timer t;
        for (int i = 0; i < iterations; i++)
        {
            result = msbFunc(inputs[i]);
        }
        double elapsed = t.elapsed();
        if ( !bIsReference )
            std::cout << name << " took " << elapsed << " seconds" << std::endl;
        if (result == -1.0f)
            std::cout << "this comparison only exists to keep the compiler from " <<
            "optimizing out the benchmark; this branch will never be called";
    }
}

void main()
{
    std::uniform_int_distribution <u64> dist64(0,
        std::numeric_limits< u64 >::max());
    std::uniform_int_distribution <u32> shift64(0, 63);
    std::vector< u64 > inputs64;
    for (int i = 0; i < iterations; i++)
    {
        inputs64.push_back(dist64(re) >> shift64(re));
    }
    std::vector< u32 > results64;
    results64.resize(iterations);

    test< u64 >(msbLoop64, "msbLoop64", inputs64, results64, true);
    test< u64 >(msbLoop64, "msbLoop64", inputs64, results64, false);
#ifdef MICROSOFT_COMPILER
    test< u64 >(msbNative64, "msbNative64", inputs64, results64, false);
#endif // MICROSOFT_COMPILER
    std::cout << std::endl;

    std::uniform_int_distribution <u32> dist32(0,
        std::numeric_limits< u32 >::max());
    std::uniform_int_distribution <u32> shift32(0, 31);
    std::vector< u32 > inputs32;
    for (int i = 0; i < iterations; i++)
        inputs32.push_back(dist32(re) >> shift32(re));
    std::vector< u32 > results32;
    results32.resize(iterations);


    test< u32 >(msbLoop32, "msbLoop32", inputs32, results32, true);

    test< u32 >(msbLoop32, "msbLoop32", inputs32, results32, false);
    test< u32 >(msbFfs32, "msbFfs", inputs32, results32, false);
    test< u32 >(msbPerformanceJunkie32, "msbPerformanceJunkie32",
        inputs32, results32, false);
    test< u32 >(msbDeBruijn32, "msbDeBruijn32", inputs32, results32, false);
#ifdef MICROSOFT_COMPILER
    test< u32 >(msbNative32, "msbNative32", inputs32, results32, false);
#endif // MICROSOFT_COMPILER
}

- johnwbyrd

1

干得不错，但是你目前正在计时msbLoop32完成的初始化工作，这意味着它看起来比实际慢两倍。 - j_random_hacker

2

感谢您的评论。我已更改代码，使引用比较不再进行基准测试，并且计时器现在启动和停止得更正确。基准测试略有变化，但高级结果仍然相同；更新的基准测试如上所示。请随意进一步改进答案。 - johnwbyrd

1

https://dev59.com/bWs05IYBdhLWcg3wPfcR - johnwbyrd

1

BeeOnRope：这个帖子里有太多的“躺在椅子上的”基准测试。展示你的代码吧。 - johnwbyrd

2

为什么输入零应该得到零的输出？位0没有被设置。当数字为零时询问最低有效位是没有意义的，因此如果方法在零时给出其他结果，那么这种方法并不是错误的。 - me'

显示剩余11条评论

19

作为一个性能追求者，我尝试了许多 MSB 设置的变体，以下是我找到的最快的一种。

unsigned int msb32(unsigned int x)
{
    static const unsigned int bval[] =
    {0,1,2,2,3,3,3,3,4,4,4,4,4,4,4,4};

    unsigned int r = 0;
    if (x & 0xFFFF0000) { r += 16/1; x >>= 16/1; }
    if (x & 0x0000FF00) { r += 16/2; x >>= 16/2; }
    if (x & 0x000000F0) { r += 16/4; x >>= 16/4; }
    return r + bval[x];
}

- Sir Slick

11

这段代码在随机分布的输入情况下比de Bruijn乘法慢四倍。此外，这个代码得出的结果与其他答案相差1；也就是说，msb(1) == 1，而其他定义中msb(1) == 0。 - johnwbyrd

11

这是StackOverflow和其他“最受欢迎的答案获胜”类型网站的一个缺陷之一。顶部答案总是普通人认为正确的答案。但是普通人并不总是正确的。人群智慧不能替代基准测试。 - johnwbyrd

13

有多种方法可以实现这个功能，不同的实现相对性能因计算机而异（我曾经针对类似的目的做过一定程度的基准测试）。在某些计算机上甚至有内置指令可用于此（如果有可用且可移植性可以处理，则应使用其中之一）。

请查看这里提供的一些实现（在“integer log base 2”下）。如果您正在使用GCC，请查看函数__builtin_clz和__builtin_clzl（分别为非零无符号整数和无符号长整型执行此操作）。 “clz”代表“count leading zeros”，这是描述同一问题的另一种方式。

当然，如果您的位数组不适合适当的机器字，您需要迭代数组中的单词以找到第一个非零单词，然后仅对该单词执行此计算。

- Arkku

+1，指出 __builtin_clz 和 __builtin_clzl 对于0的输入是未定义的（由 GCC文档支持）。 - j_random_hacker

5

查找BSR（Bit scan reverse）x86汇编指令是最快的方法。根据英特尔文档：搜索源操作数（第二个操作数）以查找最高有效位（1位）。如果找到最高有效的1位，则将其位索引存储在目标操作数（第一个操作数）中。

- ggiroux

4

http://graphics.stanford.edu/~seander/bithacks.html#IntegerLogObvious

- Martin Beckett

哈哈，我的答案中也包含了完全相同的URL，包括“#IntegerLogObvious”。 - Arkku

3

我曾经使用过多种方法来获取最高有效位，但是在32位和64位数字之间或者在x86_64和x86机器之间转换时通常会出现问题。函数__builtin_clz、__builtin_clzl和__builtin_clzll适用于32/64位数字以及跨越x86_64和x86机器。然而，需要三个函数。我找到了一种简单的依赖于右移的MSB，可以处理所有正数情况。至少对于我使用它的目的而言，它已经成功地解决了其他方法无法解决的问题:

int
getmsb (unsigned long long x)
{
    int r = 0;
    if (x < 1) return 0;
    while (x >>= 1) r++;
    return r;
}

将输入指定为unsigned long long，它可以处理从unsigned char到unsigned long long的所有数字类，并且根据标准定义，在x86_64和x86构建中兼容。对于0的情况定义为返回0，但可以根据需要更改。一个简单的测试和输出如下：

int
main (int argc, char *argv[]) {

    unsigned char c0 = 0;
    unsigned char c = 216;
    unsigned short s = 1021;
    unsigned int ui = 32768;
    unsigned long ul = 3297381253;
    unsigned long long ull = 323543844043;

    int i = 32767;

    printf ("  %16u  MSB : %d\n", c0, getmsb (c0));
    printf ("  %16u  MSB : %d\n", c, getmsb (c));
    printf ("  %16u  MSB : %d\n", s, getmsb (s));
    printf ("  %16u  MSB : %d\n", i, getmsb (i));
    printf ("  %16u  MSB : %d\n", ui, getmsb (ui));
    printf ("  %16lu  MSB : %d\n", ul, getmsb (ul));
    printf ("  %16llu  MSB : %d\n", ull, getmsb (ull));

    return 0;
}

输出：

             0  MSB : 0
           216  MSB : 7
          1021  MSB : 9
         32767  MSB : 14
         32768  MSB : 15
    3297381253  MSB : 31
  323543844043  MSB : 38

注意：出于速度考虑，使用一个围绕着__builtin_clzll 的函数来完成同样的事情仍然快了约6倍。

- David C. Rankin

2

x86 指令集有一个 BSR 指令，返回的是一个位索引（而不是它上面的前导零的数量）。

但不幸的是，并没有可移植的内置函数可以高效地为所有编译器暴露它。GNU C 提供了 `__builtin_clz`，但 `unsigned bitidx = 31 - __builtin_clz(x);` 在当前的 GCC 和 ICC 中无法优化回到只使用 BSR。（clang 可以，这证明了这个表达式是等价的，所以它“可以”）。

以下定义了BSR32()和BSR64()宏或函数，它们在x86上编译时可以高效地转换为一个bsr指令。如果输入为零，则会产生垃圾结果。使用intrinsic无法利用asm指令的行为，在输入为零时不修改目标。 可移植性到非x86架构需要一些额外的#ifdef，例如回退到31-__builtin_clz。大多数非x86 ISA（指令集体系结构）如果具有前导零位扫描，则计算前导零位而不是给出位索引。这就是为什么GNU C将__builtin_clz定义为可移植内置函数的原因。（如果目标系统没有HW支持，则该内置函数将编译为软件仿真，通常调用libgcc帮助函数。）

#include <stdint.h>

// define BSR32() and BSR64()
#if defined(_MSC_VER) || defined(__INTEL_COMPILER)
    #ifdef __INTEL_COMPILER
        typedef unsigned int bsr_idx_t;
    #else
        #include <intrin.h>   // MSVC
        typedef unsigned long bsr_idx_t;
    #endif

    static inline
    unsigned BSR32(unsigned long x){
        bsr_idx_t idx;
        _BitScanReverse(&idx, x); // ignore bool retval
        return idx;
    }
    static inline
    unsigned BSR64(uint64_t x) {
        bsr_idx_t idx;
        _BitScanReverse64(&idx, x); // ignore bool retval
        return idx;
    }
#elif defined(__GNUC__)

  #ifdef __clang__
    static inline unsigned BSR64(uint64_t x) {
        return 63-__builtin_clzll(x);
      // gcc/ICC can't optimize this back to just BSR, but clang can and doesn't provide alternate intrinsics
    }
  #else
    #define BSR64 __builtin_ia32_bsrdi
  #endif

    #include <x86intrin.h>
    #define BSR32(x) _bit_scan_reverse(x)

#endif

“bsf”可能不需要太多编译器的帮助，因为内置函数与汇编指令的行为相匹配，返回LSB的位索引，即末尾零的计数。

一个测试调用者unsigned test32(unsigned x) { return BSR32(x); }在所有主要的x86编译器中都可以内联为1个指令，在Godbolt编译器浏览器上。BSR64也以相同的方式内联，变成了一个64位操作数大小的版本。另请参见是否有一种x86 / x86_64指令可以将最高有效位以下的所有位清零？，以获取示例用途。

;; x64 MSVC 19.16 -O2
unsigned int test32(unsigned int) PROC                                    ; test32, COMDAT
        bsr     eax, ecx
        ret     0
unsigned int test32(unsigned int) ENDP                                    ; test32

# clang -O3 -march=haswell   is too "smart?" for its own good:
test32(unsigned int):
        lzcnt   eax, edi
        xor     eax, 31
        ret

# gcc8.2 -O3 -march=haswell
test32(unsigned int):
        bsr     eax, edi
        ret

# ICC19 -O3 -march=haswell
test32(unsigned int):
        bsr       eax, edi                                      #15.9
        ret                                                     #41.12

这样做的目的是避免在可移植版本（非MSVC）中出现缓慢的代码：

#ifdef __GNUC__
unsigned badgcc(uint64_t x) {
    return 63 - __builtin_clzll(x);
}
#endif

没有使用-march=haswell选项，我们只会从clang得到BSR，但是：

# gcc8.2 -O3
badgcc(unsigned long):
        bsr     rdi, rdi
        mov     eax, 63
        xor     rdi, 63
        sub     eax, edi
        ret

# ICC19.0.1 -O3
badgcc(unsigned long):
        mov       rax, -1                                       #46.17
        bsr       rdx, rdi                                      #46.17
        cmove     rdx, rax                                      #46.17
        neg       rdx                                           #46.17
        add       rdx, 63                                       #46.17
        neg       edx                                           #46.17
        add       edx, 63                                       #46.17
        mov       eax, edx                                      #46.17
        ret                                                     #46.17

这只是恶心。有趣的是，ICC正在执行CMOV以产生-1（如果输入为零）。BSR根据其输入设置ZF，而大多数指令根据结果设置标志，这很不同寻常。

使用-march = haswell（或启用BMI1指令），情况没有那么糟糕，但仍然不如只使用BSR好。除了输出依赖关系之外，编译器大多数都会避免使用lzcnt，但奇怪的是不会避免使用BSR。（其中输出依赖关系是一个真实的依赖关系，因为输入为0时的行为。）为什么打破LZCNT的“输出依赖性”很重要？

- Peter Cordes

更新：clang8.0似乎存在一个回归问题，在63-__builtin_clzll()的异或翻转优化方面没有得到优化。 - Peter Cordes

2

如果您使用x86，可以通过使用SSE2操作和查找第一个位指令（在gcc世界中发音为“ffs”表示最低位，“fls”表示最高位）来击败几乎任何按字节或按字解决方案。很抱歉我在回答中无法正确格式化“C”代码；请参阅： http://mischasan.wordpress.com/2011/11/03/sse2-bit-trick-ffsfls-for-xmm-registers/

- Mischa

1

不是最快的，但它能工作...

//// C program
#include <math.h>

#define POS_OF_HIGHESTBIT(a) /* 0th position is the Least-Signif-Bit */    \
((unsigned) log2(a))         /* thus: do not use if a <= 0 */  

#define NUM_OF_HIGHESTBIT(a) ((!(a))          \
        ? 0 /* no msb set*/                   \
        : (1 << POS_OF_HIGHESTBIT(a) ))
// could be changed and optimized, if it is known that the following NEVER holds: a <= 0



int main()
{
  unsigned a = 5; // 0b101
  unsigned b = NUM_OF_HIGHESTBIT(a); // 4 since 4 = 0b100
  return 0; 
}

- Jeff

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andras Vass · Accepted Answer

51

GCC提供了__builtin_clz函数，它在x86/x64上转换为BSR，在ARM上转换为CLZ等，并在硬件不支持该指令时进行模拟实现。
Visual C++ 2005及以上版本提供了_BitScanReverse函数。

- Andras Vass

3

当参数为0时要注意未定义的行为。 - user2357112

是的。在这种情况下，“未定义行为”意味着“返回一个不确定的随机数”。 - johnwbyrd

1

@johnwbyrd 或者它可能会进入无限循环，扫描不存在的1。当编译器的规范/手册说“未定义行为”时，没有任何东西可以阻止编译器做任何事情。 - minmaxavg

4

@minmaxavg说，使用输入为0的__builtin_clz不是C/C++中的"未定义行为"。文档说"结果是未定义的"，而不是行为。了解GCC的工作原理以及在x86上存在该警告的原因，我确定他们并没有意味着UB。在x86上，对于输入为0，它是指令运行前目标寄存器中的任何值。(asm指令保持目标不变，Intel将其记录为未定义值) 详情请参见：VS：使用_BitScanReverse64内置函数的意外优化行为。 - Peter Cordes

1

phernost：令人惊讶的是，__builtin_clz(x) & -!!x 不能保证无分支。要理解原因，您必须了解LLVM和其他编译器如何为!运算符生成代码。!运算符的正式规范说明：“逻辑否定运算符!的结果为0，如果其操作数的值与0不相等，则为1，如果其操作数的值与0相等，则为1。结果具有int类型。表达式!E等效于（0==E）。”尝试在godbolt.org上输入-!!x的表达式以获取各种编译器的代码，您会发现生成的代码并非总是无分支的。 - johnwbyrd

显示剩余4条评论