显然,现代处理器可以检测到你像将一个寄存器移动到它本身( mov %eax, %eax
)这样愚蠢的行为,并将其优化掉。为了验证该说法,我运行了以下程序:
#include <stdio.h>
#include <time.h>
static inline void f1() {
for (int i = 0; i < 100000000; i++)
__asm__(
"mov %eax, %eax;"
"nop;"
);
}
static inline void f2() {
for (int i = 0; i < 100000000; i++)
__asm__(
"nop;"
);
}
static inline void f3() {
for (int i = 0; i < 100000000; i++)
__asm__(
"mov %ebx, %eax;"
"nop;"
);
}
int main() {
int NRUNS = 10;
clock_t t, t1, t2, t3;
t1 = t2 = t3 = 0;
for (int run = 0; run < NRUNS; run++) {
t = clock(); f1(); t1 += clock()-t;
t = clock(); f2(); t2 += clock()-t;
t = clock(); f3(); t3 += clock()-t;
}
printf("f1() took %f cycles on avg\n", (float) t1/ (float) NRUNS);
printf("f2() took %f cycles on avg\n", (float) t2/ (float) NRUNS);
printf("f3() took %f cycles on avg\n", (float) t3/ (float) NRUNS);
return 0;
}
这给我带来了:
f1() took 175587.093750 cycles on avg
f2() took 188313.906250 cycles on avg
f3() took 194654.296875 cycles on avg
预料之中,
f3()
的速度最慢。但令人惊讶的是(至少对我来说),f1()
比 f2()
更快。为什么会这样呢?更新:使用
-falign-loops
进行编译得到的结果基本相同。f1() took 164271.000000 cycles on avg
f2() took 173783.296875 cycles on avg
f3() took 177765.203125 cycles on avg
clock()
不计算周期,而是CLOCKS_PER_SEC
,因此打印%f周期
是误导性的。另外,尽量不要将其转换为float
,而可以使用double
。 - fuzmov eax, eax
,而是执行了mov ebx, eax
。 - 500 - Internal Server Error-falign-loops
。使用优化后,代码不会终止,因为您覆盖了eax
。 - fuz