从64位数字中“隔离”特定的行/列/对角线

Question

从64位数字中“隔离”特定的行/列/对角线

c++performance64-bitbit-manipulationbitboard

34

好的，让我们考虑一个64位数字，它的二进制位组成了一个8x8的表格。

例如：

0 1 1 0 1 0 1 0 0 1 1 0 1 0 1 1
0 1 1 1 1 0 1 0 0 1 1 0 1 0 1 0
1 1 1 0 1 0 1 1 0 1 0 1 0 0 1 1
0 1 0 1 0 0 1 1 0 1 0 1 0 0 1 1
0 1 1 1 0 0 1 1 0 1 0 1 0 1 0 1
1 1 1 0 0 1 1 0 1 0 1 0 1 0 1 0
0 1 1 0 1 0 1 0 1 1 1 0 1 0 1 0
1 1 0 1 0 0 1 1 0 1 0 1 0 1 0 1

写成:

a b c d e f g h
----------------
0 1 1 0 1 0 1 0
0 1 1 0 1 0 1 1 
0 1 1 1 1 0 1 0 
0 1 1 0 1 0 1 0 
1 1 1 0 1 0 1 0 
0 1 1 0 1 0 1 0 
0 1 1 0 1 1 1 0 
0 1 1 0 1 0 1 0

现在，如果我们想要隔离列d（00100000）（或任何行/对角线），怎么办？

~~能做到吗？~~如果可以，如何做？

提示：

（a）我的主要目标-虽然最初没有提到-是原始速度。我正在寻找最快的算法，因为“检索”函数每秒执行数百万次。
（b）这是最接近我所指的：https://www.chessprogramming.org/Kindergarten_Bitboards

- Dr.Kameleon

5

你真的刚问“能做到吗”吗？你真的期望答案是“不行，这完全不可能”吗？ - Kerrek SB

1

@KerrekSB 不好意思，我其实是在问“怎么做”...哈哈。 - Dr.Kameleon

1

当然可以完成。你可以使用一些位移、与和或操作来实现。根据你想要实际做什么[或者这是一个学校练习吗？]，可能有不同类型的“聪明技巧”可以应用，但位移、与和或是你想要的东西。你想让我发布带有这个答案的帖子吗？ - Mats Petersson

1

@NPE 是的，没错。我的目标是完全不使用循环；我实际上正在寻找最快的方法。 - Dr.Kameleon

1

@Dr.Kameleon：循环和最快速度不一定是互相排斥的。请看我的回答。 - NPE

显示剩余4条评论

9个回答

7

好的，为了“解决”哪个更快/更慢等争论，我把所有代码放到一个程序中[我“希望”我已经正确地归功于每个代码片段的原作者]。

下面可以找到代码，以检查我在将其转换为函数时是否正确解释了代码。我确实运行了它，没有适当的输出，并检查了每个函数是否给出相同的结果[请注意，在某些情况下，顺序略有不同 - 因此，我进行了变化，以使我的代码运行另一种方式，只是为了看到它是否给出了“正确”的结果]。因此，废话不多说，以下是结果:

mats1 time in clocks per iteration 10.3457
mats2 time in clocks per iteration 10.4785
mats3 time in clocks per iteration 10.5538
viraptor time in clocks per iteration 6.24603
lemees time in clocks per iteration 14.4818
npe time in clocks per iteration 13.1455
alex time in clocks per iteration 24.8272

mats1 time in clocks per iteration 7.62338
mats2 time in clocks per iteration 7.36226
mats3 time in clocks per iteration 7.45361
viraptor time in clocks per iteration 2.09582
lemees time in clocks per iteration 9.43744
npe time in clocks per iteration 7.51016
alex time in clocks per iteration 19.3554

(来自viraptor的结果，基于core i5和clang++ 3.2)

mats1 time in clocks per iteration 12.956
mats2 time in clocks per iteration 13.4395
mats3 time in clocks per iteration 13.3178
viraptor time in clocks per iteration 2.12914
lemees time in clocks per iteration 13.9267
npe time in clocks per iteration 16.2102
alex time in clocks per iteration 13.8705

这是在3.4GHz的AMD Athlon2上的时钟周期 - 我没有现代的英特尔机器 - 如果有人希望在那上运行代码，我很想看看它的样子。我相当确定所有的东西都在缓存中良好运行 - 也许除了获取一些值以进行检查之外。

因此，赢家显然是viraptor，大约比“我的”代码快40%。Alex的代码没有任何跳转/分支，但似乎仍然比其他选择运行得慢。不确定为什么npe的结果比我的慢那么多 - 它几乎做了同样的事情（当从g ++的汇编器输出查看代码时，代码看起来非常相似）。

#include <iostream>
#include <fstream>
#include <cstdint>

using namespace std;

const int SIZE = 1000000;

uint64_t g_val[SIZE];

ofstream nulloutput;

static __inline__ unsigned long long rdtsc(void)
{
    unsigned hi, lo;
    __asm__ __volatile__ ("rdtsc" : "=a"(lo), "=d"(hi));
    return ( (unsigned long long)lo)|( ((unsigned long long)hi)<<32 );
}

#define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b))

unsigned char get_col_mats1(uint64_t val, int col)
{
    return BITA_TO_B(val, 56+col, 7) |
    BITA_TO_B(val, 48+col, 6) |
    BITA_TO_B(val, 40+col, 5) |
    BITA_TO_B(val, 32+col, 4) |
    BITA_TO_B(val, 24+col, 3) |
    BITA_TO_B(val, 16+col, 2) |
    BITA_TO_B(val, 8+col, 1) |
    BITA_TO_B(val, 0+col, 0);
}

unsigned char get_col_mats2(uint64_t val, int col)
{
    return BITA_TO_B(val, 63-col, 7) |
    BITA_TO_B(val, 55-col, 6) |
    BITA_TO_B(val, 47-col, 5) |
    BITA_TO_B(val, 39-col, 4) |
    BITA_TO_B(val, 31-col, 3) |
    BITA_TO_B(val, 23-col, 2) |
    BITA_TO_B(val, 15-col, 1) |
    BITA_TO_B(val, 7-col, 0);
}


unsigned char get_col_viraptor(uint64_t board, int col) {
    const uint64_t column_mask = 0x8080808080808080ull;
    const uint64_t magic = 0x2040810204081ull ;
    uint64_t column = board & (column_mask >> col);
    column <<= col;
    column *= magic;
    return (column >> 56) & 0xff;
}


unsigned char get_col_alex(uint64_t bitboard, int col)
{
    unsigned char result;
    result |= (bitboard & (1ULL << 63-col)) ? 0x80 : 0;
    result |= (bitboard & (1ULL << 55-col)) ? 0x40 : 0;
    result |= (bitboard & (1ULL << 47-col)) ? 0x20 : 0;
    result |= (bitboard & (1ULL << 39-col)) ? 0x10 : 0;
    result |= (bitboard & (1ULL << 31-col)) ? 0x08 : 0;
    result |= (bitboard & (1ULL << 23-col)) ? 0x04 : 0;
    result |= (bitboard & (1ULL << 15-col)) ? 0x02 : 0;
    result |= (bitboard & (1ULL << 7-col))  ? 0x01 : 0;

    return result;
}

unsigned char get_col_lemees(uint64_t val, int column)
{
    int result = 0;
    int source_bitpos = 7 - column; // "point" to last entry in this column
    for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
    {
    bool bit = (val >> source_bitpos) & 1;  // "extract" bit
    result |= bit << target_bitpos;            // add bit if it was set
    source_bitpos += 8;                        // move one up in table
    }
    return result;
}

int get(uint64_t board, int row, int col) {
  return (board >> (row * 8 + col)) & 1;
}

uint8_t get_col_npe(uint64_t board, int col) {
  uint8_t ret = 0;
  for (int i = 0; i < 8; ++i) {
    ret = (ret << 1) + get(board, i, col);
  }
  return ret;
}



#define BITA_TO_B2(x, a, b) (((x) >> (a-b)) & (1 << b))

unsigned char get_col_mats3(uint64_t val, int col)
{
    return BITA_TO_B2(val, 63-col, 7) |
    BITA_TO_B2(val, 55-col, 6) |
    BITA_TO_B2(val, 47-col, 5) |
    BITA_TO_B2(val, 39-col, 4) |
    BITA_TO_B2(val, 31-col, 3) |
    BITA_TO_B2(val, 23-col, 2) |
    BITA_TO_B2(val, 15-col, 1) |
    BITA_TO_B2(val, 7-col, 0);
}

template<unsigned char (*f)(uint64_t val, int col)>
void runbench(const char *name)
{
    unsigned char col[8]  = {0};
    uint64_t long t = rdtsc();
    for(int j = 0; j < SIZE; j++)
    {
    uint64_t val = g_val[j];
    for(int i = 0; i < 8; i++)
    {
        col[i] += f(val, i);
    }
//  __asm__ __volatile__("":::"memory");
    }
    t = rdtsc() - t;
    for(int i = 0; i < 8; i++)
    {
    nulloutput<< "col " << i << " has bits " << hex << (int)col[i] << endl;
    }
    cout << name << " time in clocks per iteration " << dec << t / (8.0 * SIZE) << endl;
}

#define BM(name) void bench_##name() { runbench<get_col_##name>(#name); }

BM(mats1);
BM(mats2);
BM(mats3);
BM(viraptor);
BM(lemees);
BM(npe);
BM(alex);

struct function
{
    void (*func)(void);
    const char *name;
};


#define FUNC(f) { bench_##f, #f }

function funcs[] = 
{
    FUNC(mats1),
    FUNC(mats2),
    FUNC(mats3),
    FUNC(viraptor),
    FUNC(lemees),
    FUNC(npe),
    FUNC(alex),
}; 


int main()
{
    unsigned long long a, b;
    int i;
    int sum = 0;

    nulloutput.open("/dev/nul");
    for(i = 0; i < SIZE; i++)
    {
    g_val[i] = rand() + ((long)rand() << 32L);
    }
    unsigned char col[8];

    for(i = 0; i < sizeof(funcs)/sizeof(funcs[0]); i++)
    {
    funcs[i].func();
    }
}

- Mats Petersson

我本来想尝试使用clang运行它，但出于某些原因，clang不喜欢iostream并崩溃了。昨天还没这个问题！:( - Mats Petersson

1

感谢您收集这些解决方案。希望您不介意编辑 - 添加了来自Core i5 GCC和Clang的结果。 - viraptor

很好。有趣的是，clang并不像gcc那么好 - 大多数人似乎喜欢clang... - Mats Petersson

啊，我本来应该将我的实现模板化的... 我会在周三进行基准测试。 - Alex Chamberlain

2

使用简单的循环编写代码，并让优化器为您执行内联和循环展开。

使用4.7.2编译并使用-O3，在我的计算机上以下代码可以每秒执行约3亿个get_col()调用。

bitboard.cpp:

#include <cinttypes>
#include <iostream>

int get(uint64_t board, int row, int col) {
  return (board >> (row * 8 + col)) & 1;
}

uint8_t get_col(uint64_t board, int col) {
  uint8_t ret = 0;
  for (int i = 0; i < 8; ++i) {
    ret = (ret << 1) + get(board, i, col);
  }
  return ret;
}

extern uint64_t board;
extern int sum;

extern void f();

int main() {
  for (int i = 0; i < 40000000; ++i) {
    for (int j = 0; j < 8; ++j) {
      sum += get_col(board, j);
    }
    f();
  }
  std::cout << sum << std::endl;
}

bitboard_b.cpp:

#include <cinttypes>

uint64_t board = 0x1234567890ABCDEFull;
int sum = 0;

void f() {}

如果你查看get_col()的汇编代码，你会发现它不包含任何循环，可能已经是你能手工制作的最有效率的代码：

__Z7get_colyi:
LFB1248:
        movl    %esi, %ecx
        movq    %rdi, %rax
        movq    %rdi, %rdx
        shrq    %cl, %rax
        leal    8(%rsi), %ecx
        andl    $1, %eax
        shrq    %cl, %rdx
        leal    16(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    24(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    32(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %eax
        movq    %rdi, %rdx
        shrq    %cl, %rdx
        leal    40(%rsi), %ecx
        andl    $1, %edx
        leal    (%rdx,%rax,2), %edx
        movq    %rdi, %rax
        shrq    %cl, %rax
        leal    48(%rsi), %ecx
        andl    $1, %eax
        leal    (%rax,%rdx,2), %edx
        movq    %rdi, %rax
        shrq    %cl, %rax
        leal    56(%rsi), %ecx
        andl    $1, %eax
        leal    (%rax,%rdx,2), %eax
        shrq    %cl, %rdi
        andl    $1, %edi
        leal    (%rdi,%rax,2), %eax
        ret

这并不是一个完整的实现，只是对这个想法的粗略说明。特别地，位序可能与您预期的相反等。

- NPE

这个NOOP f()调用是什么意思？防止优化吗？ - leemes

@leemes：是的。否则，gcc 通过将内部循环的结果乘以 40000000 来优化掉 40000000 循环。 - NPE

“ret = (ret << 1) + ...” 和 “ret |= ... << i” 一样快吗？现在你已经做了一些小基准测试，能否请你进行比较？;) - leemes

@Dr.Kameleon：只是为了明确，您是说每秒300百万次调用速度不够快？如果是这样，您需要明确说明您正在寻求什么级别的性能。 - NPE

@AlexChamberlain：请展示给我们看并包含一些基准测试。该问题明确要求最高性能，任何未经过基准测试的都是纯粹的猜测。 - NPE

显示剩余2条评论

1

这里有一个解决方案，可以在一个周期内执行（如果值和掩码都在寄存器中），如果你愿意使用英特尔的PEXT指令的内置函数（如果你在做位棋盘相关的东西，你可能会用到）。

int get_col(uint64_t board) {
    return _pext_u64(board, 0x8080808080808080ull);
}

这是针对第0列的操作 - 如果您想要另一列，只需相应地移动掩码。当然，这是通过使用硬件特定指令来作弊，但位棋盘就是关于作弊。

- BeeOnRope

1

在您的情况下（专门针对8x8 = 64位表），您需要执行位移操作以提取特定的位并将它们重新排列到一个新的整数变量中，同时使用位移操作。

uint64_t matrix = ... // input
int column = 3; // "d"-column

int result = 0;
int source_bitpos = 7 - column; // "point" to last entry in this column
for (int target_bitpos = 0; target_bitpos < 8; ++target_bitpos)
{
    bool bit = (matrix >> source_bitpos) & 1;  // "extract" bit
    result |= bit << target_bitpos;            // add bit if it was set
    source_bitpos += 8;                        // move one up in table
}

看这里：

点击链接：http://ideone.com/UlWAAO

- leemes

将整个位掩码和移位不是更简单吗？ - juanchopanza

@juanchopanza 因为矩阵是按行主序给出的，而他询问的是列，所以你必须重新排列位。或者我理解错了吗？ - leemes

1

#define BITA_TO_B(x, a, b) (((x) >> (a)) & 1) << b;

unsigned long long x = 0x6A6B7A6AEA6E6A;

unsigned char col_d = BITA_TO_B(x, 60, 7) |
                      BITA_TO_B(x, 52, 6) |
                      BITA_TO_B(x, 44, 5) |
                      BITA_TO_B(x, 36, 4) |
                      BITA_TO_B(x, 28, 3) |
                      BITA_TO_B(x, 20, 2) |
                      BITA_TO_B(x, 12, 1) |
                      BITA_TO_B(x,  4, 0);

也许一个更加优化的想法：

#define BITA_TO_B(x, a, b) (((x) >> (a-b)) & (1 << b));

如果b是一个常量，这样做会稍微更快一些。

另一种方法可能是：

unsigned long xx = x & 0x10101010101010; 
col_d = (xx >> 53) | (xx >> 46) | (xx >> 39) ... (xx >> 4);

一次性使用 "and" 而不是多次使用可以加快速度。

- Mats Petersson

看起来我们有相似的想法...我决定使用常量而不是移动 x，因为我认为这样会稍微更快一些？你有什么想法？ - Alex Chamberlain

我本希望编译器能够解决那个问题。当然，我们可以通过优化来确保它真正将正确的位放在正确的位置上。即将编辑。 - Mats Petersson

完全可以不使用任何移位操作来实现。 - Alex Chamberlain

@NPE 我们不是在猜测，而是利用我们的经验设计最佳解决方案。此外...我目前没有访问编译器的权限。 - Alex Chamberlain

2

谁在给这个回复点踩？请说明原因！ - Mats Petersson

显示剩余3条评论

1

你可以转置这个数字，然后简单地选择相关的列，这些列现在是行，并且是连续的位，就像你想要的那样。

在我的测试中，它并没有比将8个单独选择的位OR在一起好多少，但是如果你打算选择多个列（因为转置是限制因素），它要好得多。

- harold

0

这个怎么样？

uint64_t bitboard = ...;
uint8_t result = 0;

result |= (bitboard & (1ULL << 60)) ? 0x80 : 0;
result |= (bitboard & (1ULL << 52)) ? 0x40 : 0;
result |= (bitboard & (1ULL << 44)) ? 0x20 : 0;
result |= (bitboard & (1ULL << 36)) ? 0x10 : 0;
result |= (bitboard & (1ULL << 28)) ? 0x08 : 0;
result |= (bitboard & (1ULL << 20)) ? 0x04 : 0;
result |= (bitboard & (1ULL << 12)) ? 0x02 : 0;
result |= (bitboard & (1ULL <<  4)) ? 0x01 : 0;

- Alex Chamberlain

0

这是来自Chess Programming Wiki的内容。它可以转置棋盘，然后隔离单个行变得微不足道。它还可以让您反向操作。

/**
 * Flip a bitboard about the antidiagonal a8-h1.
 * Square a1 is mapped to h8 and vice versa.
 * @param x any bitboard
 * @return bitboard x flipped about antidiagonal a8-h1
 */
U64 flipDiagA8H1(U64 x) {
   U64 t;
   const U64 k1 = C64(0xaa00aa00aa00aa00);
   const U64 k2 = C64(0xcccc0000cccc0000);
   const U64 k4 = C64(0xf0f0f0f00f0f0f0f);
   t  =       x ^ (x << 36) ;
   x ^= k4 & (t ^ (x >> 36));
   t  = k2 & (x ^ (x << 18));
   x ^=       t ^ (t >> 18) ;
   t  = k1 & (x ^ (x <<  9));
   x ^=       t ^ (t >>  9) ;
   return x;
}

- Thomas Ahle

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- viraptor · Accepted Answer

以下是仅需四个主要步骤的解决方案：

const uint64_t column_mask = 0x8080808080808080ull;
const uint64_t magic = 0x2040810204081ull;

int get_col(uint64_t board, int col) {
    uint64_t column = (board << col) & column_mask;
    column *= magic;
    return (column >> 56) & 0xff;
}

它的工作方式如下：

将板子移动以使列与左侧对齐
进行掩码操作，使其只包含所需的列（0..8）
将其乘以魔数，结果是所有原始位都被推到了左侧
最左边的字节向右移动

魔数被选择出来只复制需要的位，并让其余部分落在未使用的地方/溢出数字上。该过程看起来像这样（数字是“ID”位，而不是数字本身）：

original column: ...1.......2.......3.......4.......5.......6.......7.......8....
aligned column:  1.......2.......3.......4.......5.......6.......7.......8.......
multiplied:      123456782345678.345678..45678...5678....678.....78......8.......
shifted to right:........................................................12345678

如果你添加了const关键字，汇编实际上会变得非常不错：

get_col:
.LFB7:
        .cfi_startproc
        movl    %esi, %ecx
        movabsq $-9187201950435737472, %rax
        salq    %cl, %rdi
        andq    %rax, %rdi
        movabsq $567382630219905, %rax
        imulq   %rax, %rdi
        shrq    $56, %rdi
        movl    %edi, %eax
        ret

没有分支，没有外部数据，每个计算大约需要0.4纳秒。

编辑：使用NPE的解决方案作为基线（下一个最快的解决方案），计算时间约为原来的六分之一。