在一个整数中找到第n个SET位

Question

在一个整数中找到第n个SET位

algorithmfunctionbinary

31

不仅仅是寻找最低位的二进制位，我想要找到第 n 个最低位的二进制位。（我不是在谈论二进制位上的值）

例如，如果我有：
0000 1101 1000 0100 1100 1000 1010 0000

并且我想找到第 4 个被设置为 1 的二进制位。那么我希望它返回：
0000 0000 0000 0000 0100 0000 0000 0000

如果 popcnt(v) < n，这个函数返回 0 也是合理的，但是任何针对这种情况的行为都可以接受。

如果可能的话，我希望找到比循环更快的方法。

- VoidStar

1

是的，在运行时你提供v和n。我也想不出任何不使用循环的方法来解决它。虽然将问题分解很困难，但我并不认为打败循环是不可能的。 - VoidStar

6

在位操作技巧页面中提供了解决相反问题的解决方案。向下滚动至“选择具有给定计数（等级）的位位置（从最高有效位开始）”部分。您应该能够重新调整它以计算相反方向的位数。 - Sergey Kalinichenko

1

@dasblinkenlight 不错！不过，说实话，我并不真正理解算法是如何工作的。让我们看看能否弄明白它。 - fuz

1

一个 popcount 二分查找怎么样？使用掩码限制考虑的位集？ - Nominal Animal

1

我的最初想法是将32位数字分成4个单字节块，使用256字节查找表快速获取每个字节的设置位数，然后使用这些中间步骤来定位所需位置。因此，我认为Bit Twiddling Hacks中的思路（或多或少）相同，但没有查找表和无分支。 - vgru

显示剩余6条评论

15个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- markt1964 · Answer 1

在 Jukka Suomela 给出的答案基础上，该答案使用了一种可能不一定可用的机器特定指令，因此也可以编写一个函数来执行与 _pdep_u64 完全相同的操作，而不依赖于任何机器。它必须循环遍历其中一个参数中的设置位，但仍然可以描述为 C++11 的 constexpr 函数。

constexpr inline uint64_t deposit_bits(uint64_t x, uint64_t mask, uint64_t b, uint64_t res) {
    return mask != 0 ? deposit_bits(x, mask & (mask - 1), b << 1, ((x & b) ? (res | (mask & (-mask))) : res)) : res;
}

constexpr inline uint64_t nthset(uint64_t x, unsigned n)  {
    return deposit_bits(1ULL << n, x, 1, 0);
}

- Serge Rogatch · Answer 2

PDEP 的解决方案非常好，但是一些语言（如Java）尚未包含此内置函数，但在其他低级操作上效率很高。因此，我想到了以下的备选方案：无分支二分查找。

// n must be using 0-based indexing.
// This method produces correct results only if n is smaller
// than the number of set bits.
public static int getNthSetBit(long mask64, int n) {
    // Binary search without branching
    int base = 0;
    final int low32 = (int) mask64;
    final int high32n = n - Integer.bitCount(low32);
    final int inLow32 = high32n >>> 31;
    final int inHigh32 = inLow32 ^ 1;
    final int shift32 = inHigh32 << 5;
    final int mask32 = (int) (mask64 >>> shift32);
    n = ((-inLow32) & n) | ((-inHigh32) & high32n);
    base += shift32;

    final int low16 = mask32 & 0xffff;
    final int high16n = n - Integer.bitCount(low16);
    final int inLow16 = high16n >>> 31;
    final int inHigh16 = inLow16 ^ 1;
    final int shift16 = inHigh16 << 4;
    final int mask16 = (mask32 >>> shift16) & 0xffff;
    n = ((-inLow16) & n) | ((-inHigh16) & high16n);
    base += shift16;

    final int low8 = mask16 & 0xff;
    final int high8n = n - Integer.bitCount(low8);
    final int inLow8 = high8n >>> 31;
    final int inHigh8 = inLow8 ^ 1;
    final int shift8 = inHigh8 << 3;
    final int mask8 = (mask16 >>> shift8) & 0xff;
    n = ((-inLow8) & n) | ((-inHigh8) & high8n);
    base += shift8;

    final int low4 = mask8 & 0xf;
    final int high4n = n - Integer.bitCount(low4);
    final int inLow4 = high4n >>> 31;
    final int inHigh4 = inLow4 ^ 1;
    final int shift4 = inHigh4 << 2;
    final int mask4 = (mask8 >>> shift4) & 0xf;
    n = ((-inLow4) & n) | ((-inHigh4) & high4n);
    base += shift4;

    final int low2 = mask4 & 3;
    final int high2n = n - (low2 >> 1) - (low2 & 1);
    final int inLow2 = high2n >>> 31;
    final int inHigh2 = inLow2 ^ 1;
    final int shift2 = inHigh2 << 1;
    final int mask2 = (mask4 >>> shift2) & 3;
    n = ((-inLow2) & n) | ((-inHigh2) & high2n);
    base += shift2;

    // For the 2 bits remaining, we can take a shortcut
    return base + (n | ((mask2 ^ 1) & 1));
}

- fuz · Answer 3

基于Juha Järvi在著名的Bit Twiddling Hacks中发表的一种方法，我测试了这个实现，其中n和i与问题中使用的相同：

    a = i - (i >> 1 & 0x55555555);
    b = (a & 0x33333333) + (a >> 2 & 0x33333333);
    c = b + (b >> 4) & 0x0f0f0f0f;

    r = n + 1;
    s = 0;
    t = c + (c >> 8) & 0xff;

    if (r > t) {
        s += 16;
        r -= t;
    }

    t = c >> s & 0xf;

    if (r > t) {
        s += 8;
        r -= t;
    }

    t = b >> s & 0x7;

    if (r > t) {
        s += 4;
        r -= t;
    }

    t = a >> s & 0x3;

    if (r > t) {
        s += 2;
        r -= t;
    }

    t = i >> s & 0x1;

    if (r > t)
        s++;

    return (s);

根据我的测试，这个循环在x86上的速度大约与它相当，而在arm64上快20％，在arm上可能会更快，因为有快速条件指令，但我现在无法测试。

- paperclip optimizer · Answer 4

这可能是你正在寻找的解决方案，假设你不能使用BMI扩展。

uint64_t nth_set_fast (uint64_t m, int n) {

    // count set bits in every block of 7
    uint64_t pc = (m &~0xAA54A952A54A952A) + ((m &0xAA54A952A54A952A)>>1);
             pc = (pc&~0xCC993264C993264C) + ((pc&0xCC993264C993264C)>>2);
             pc = (pc&~0xF0E1C3870E1C3870) + ((pc&0xF0E1C3870E1C3870)>>4);

    // prefix scan partial sums
    pc *= 0x0102040810204081<<7;

    // copy n to all blocks
    uint64_t nn = uint64_t(n)* 0x0102040810204081;

    // substract nn-pc for each block without carry
    uint64_t ss = nn + (~pc & ~(0x8102040810204081>>1)) + 0x8102040810204081;

    // find correct block
    uint64_t cc= ss & ~(ss>>7) & (0x8102040810204081>>1); cc>>=6;

    // block mask
    uint64_t bb = (cc<<8) -cc; 

    m &= bb; // zero all other blocks

    // xor-prefix scan; select odd/even depending on remainder bit
    uint64_t m0 = clmul(m ,0xFF) & m ; m0 ^=  m  & ( -(ss&cc));
    uint64_t m1 = clmul(m0,0xFF) & m0; m1 ^=  m0 & ( -((ss>>1)&cc));
    uint64_t m2 = clmul(m1,0xFF) & m1; m2 ^=  m1 & ( -((ss>>2)&cc));
    uint64_t m3 = clmul(m2,0xFF) & m2; m3 ^=  m2 & ( -((ss>>3)&cc)); // last step needed because of leftover bit at index 63

    return m3 & bb;
}

// carry-less multiplication; will compile to PCLMULQDQ on x86
uint64_t clmul (uint64_t n, uint64_t m) {

    u64x2 a, b;
    a[0] = n;
    b[0] = m;
    auto r = __builtin_ia32_pclmulqdq128(a,b,0); // immediate:
    return r[0];
}

没有循环，没有条件分支，没有表格，也没有BMI指令。经过一些工作，可以进行优化，使其完全在SIMD寄存器上运行。

如果您无法访问无进位零内部函数，您可以轻松地自己实现它，因为第二个参数是固定的，且只有8位长。

- Simon Goater · Answer 5

由于我正准备在这个相关问题是否有一种非迭代的方法来找到第N个设置位的索引？被关闭之前发布一个相当便携的答案，我想我会在这里发布它，以防其他人觉得它有趣或有用。我发现在热循环中，与链接问题中显示的迭代位移算法相比，它的平均速度要快大约25％，前提是使用-msse4编译。它使用了一个查找表，所以有关热缓存的旧论点也是存在的。

uint64_t pos_of_nth_bit2(uint64_t X, uint64_t bit) {
  // Requires that __builtin_popcountll(X) > bit.
  int32_t testx, pos, pop;
  int8_t lut[4][16] = {{0,0,1,0,2,0,1,0,3,0,1,0,2,0,1,0},
                       {0,0,0,1,0,2,2,1,0,3,3,1,3,2,2,1},
                       {0,0,0,0,0,0,0,2,0,0,0,3,0,3,3,2},
                       {0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,3}};
  _Bool test;
  pos = 0;
  pop = __builtin_popcount(X & 0xffffffffUL);
  test = pop <= bit;
  bit -= test*pop;
  testx = test*32;
  X >>= testx;
  pos += testx;
  pop = __builtin_popcount(X & 0xffffUL);
  test = pop <= bit;
  bit -= test*pop;
  testx = test*16;
  X >>= testx;
  pos += testx;
  pop = __builtin_popcount(X & 0xffUL);
  test = pop <= bit;
  bit -= test*pop;
  testx = test*8;
  X >>= testx;
  pos += testx;
  pop = __builtin_popcount(X & 0xfUL);
  test = pop <= bit;
  bit -= test*pop;
  testx = test*4;
  X >>= testx;
  pos += testx;
  return pos + lut[bit][X & 0xf];
}