这似乎是一个关于
8字节按位转置问题的概括。那个问题只涉及8x8的矩阵转置,所以你提出的问题有点不同。但是你的问题同样可以在书籍
Hacker's Delight的第7.3节中得到解答(你也许可以在
谷歌图书上查看相关页面)。在那里介绍的代码显然来自
Guy Steele。
Hacker's Delight网站仅包含书中8x8和32x32情况的源代码,但后者可以轻松地推广到64x64的情况:
#include <stdint.h>
void
transpose64(uint64_t a[64]) {
int j, k;
uint64_t m, t;
for (j = 32, m = 0x00000000FFFFFFFF; j; j >>= 1, m ^= m << j) {
for (k = 0; k < 64; k = ((k | j) + 1) & ~j) {
t = (a[k] ^ (a[k | j] >> j)) & m;
a[k] ^= t;
a[k | j] ^= (t << j);
}
}
}
这个函数的工作方式是交换逐渐变小的比特块,从32x32的块开始(不转置这些块内部的位),然后在这些32x32的块内部交换适当的16x16块等等。保存块大小的变量为j。因此,外循环使得j依次取32、16、8、4、2和1的值,这意味着外循环运行六次。内循环在您的位的一半行上运行,其中给定变量k中的某个位等于零。当j为32时,它们是0-31行,当j为16时,它们是0-15和32-47行,等等。内部循环部分总共运行了6*32=192次。在这个内部部分发生的事情是,掩码m决定应该交换哪些位,在t中计算这些位的异或,然后使用这个异或列表来适当地更新两个位置的位。
这本书(以及网站)还有一个版本的代码,其中这些循环都已经展开,并且掩码m没有被计算,而只是被赋值。我猜这取决于寄存器的数量和指令缓存的大小,是否会有所改进?
为了测试这个工作是否正常,假设我们定义了一些位模式,比如:
uint64_t logo[] = ;
我们接着调用
transpose32
函数并打印输出结果的比特模式:
#include <stdio.h>
void
printbits(uint64_t a[64]) {
int i, j;
for (i = 0; i < 64; i++) {
for (j = 63; j >= 0; j--)
printf("%c", (a[i] >> j) & 1 ? '1' : '0');
printf("\n");
}
}
int
main() {
transpose64(logo);
printbits(logo);
return 0;
}
然后,这将作为输出结果:
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000011111111111111111111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000000000000000000000000111111
0000000000000000000000000000000000000011000000011111100000111111
0000000000000000000000000000000000111111000000011111100000111111
0000000000000000000000000000000000111111000000011111100000111111
0000000000000000000000000000000000111111000000011111100000111111
0000000000000000000000000100000000011111000000011111100000111111
0000000000000000000000011110000000011111100000011111100000111111
0000000000000000000001111110000000011111100000011111100000111111
0000000000000000000001111111000000011111100000011111100000111111
0000000000000000000000111111000000011111100000011111100000111111
0000000000000000000000111111100000001111110000011111100000111111
0000000000000000000000011111100000001111110000011111100000111111
0000000000000100000000011111110000001111110000011111100000111111
0000000000001110000000001111110000001111110000011111100000111111
0000000000011110000000001111111000001111110000011111100000111111
0000000001111111000000000111111000000111111000011111100000111111
0000000000111111100000000111111100000111111000011111100000111111
0000000000111111110000000011111100000111111000011111100000111111
0000000000011111111000000011111100000111111000011111100000111111
0000000000001111111100000001111110000011111000011111100000111111
0000000000000111111100000001111110000011111100011111100000111111
0000000000000011111110000000111111000011111100011111100000111111
0001000000000001111111000000111111000011111100011111100000111111
0011110000000001111111100000111111100011111100011111100000111111
0111111000000000111111110000011111100001111100011111100000111111
0111111110000000011111111000011111110001111110011111100000111111
1111111111000000001111111000001111110001111110011111100000111111
0011111111100000000111111100001111111001111110011111100000111111
0001111111111000000011111110000111111001111110011111100000111111
0000111111111100000011111111000111111100111100000000000000111111
0000001111111110000001111111100011111100000000000000000000111111
0000000111111111100000111111110011111000000000000000000000111111
0000000011111111110000011111110001100000000000000000000000111111
0000000000111111111000001111111000000000000000000000000000111111
0000000000011111111110000111111000000000000000000000000000111111
0000000000001111111111000111110000000000011111111111111111111111
0000000000000011111111100011100000000000011111111111111111111111
0000000000000001111111111001000000000000011111111111111111111111
0000000000000000111111111100000000000000011111111111111111111111
0000000000000000001111111100000000000000011111111111111111111111
0000000000000000000111111000000000000000011111111111111111111111
0000000000000000000011110000000000000000000000000000000000000000
0000000000000000000000100000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
0000000000000000000000000000000000000000000000000000000000000000
这个结果很不错,正如我们所希望的。
编辑:
实际上,这并不是你要求的非破坏性版本的代码。你可以通过将32x32块的第一个交换从x
到y
来获得这个版本。例如,你可以做以下操作:
void
non_destructive_transpose64(uint64_t x[64], uint64_t y[64]) {
int j, k;
uint64_t m, t;
for (k = 0; k < 64; k += 2) {
((uint32_t *) y)[k] = ((uint32_t *) x)[k ^ 64 + 1];
((uint32_t *) y)[k + 1] = ((uint32_t *) x)[k + 1];
}
for (; k < 128; k += 2) {
((uint32_t *) y)[k] = ((uint32_t *) x)[k];
((uint32_t *) y)[k + 1] = ((uint32_t *) x)[k ^ 64];
}
for (j = 16, m = 0x0000FFFF0000FFFF; j; j >>= 1, m ^= m << j) {
for (k = 0; k < 64; k = ((k | j) + 1) & ~j) {
t = (y[k] ^ (y[k | j] >> j)) & m;
y[k] ^= t;
y[k | j] ^= (t << j);
}
}
}
与其他版本不同,这个版本的代码不管架构的字节序如何都无法工作。此外,我知道C标准不允许您将
uint64_t
数组作为
uint32_t
数组访问。然而,如果你像这样做,第一次移动块的循环中不需要移位或异或操作,我会喜欢它。