对于上述函数,我使用了cmovne得到了完全相同的汇编代码。
当然,有些编译器可能会进行这种优化,但不能保证。写两种不同方式的函数可能得到不同的目标代码是非常可能的。
实际上,没有任何优化是保证的(尽管现代优化编译器大多数情况下都做得很出色),因此您应该编写捕获您打算拥有的语义含义的代码,或者验证生成的目标代码并编写代码以确保您获得预期的输出。
以下是针对x86-32目标的旧版本MSVC在
主要因为它们不知道如何使用CMOV指令时将生成的内容。
test0 PROC
cmp BYTE PTR [c], 99
mov eax, DWORD PTR [x]
mov DWORD PTR [eax], 0
jne SHORT LN2
mov DWORD PTR [eax], 15
LN2:
ret 0
test0 ENDP
test1 PROC
mov eax, DWORD PTR [x]
xor ecx, ecx
cmp BYTE PTR [c], 99
setne cl
dec ecx
and ecx, 15
mov DWORD PTR [eax], ecx
ret 0
test1 ENDP
请注意,
test1
会给你无分支代码,利用
SETNE
指令(一种条件设置,根据条件码将其操作数设置为 0 或 1,在本例中为
NE
)与一些位操作结合产生正确的值。
test0
使用条件分支跳过赋值给
*x
的 15。
这个有趣之处在于它几乎完全是你所预期的相反。天真地说,人们可能希望 test0
是你握住优化器手并让其生成无分支代码的方式。至少,这是我想到的第一个想法。但事实上,情况并非如此!优化器能够识别 if
/else
惯用语并进行相应的优化!它不能在 test0
的情况下进行相同的优化,你试图超越它。
然而,当添加一个额外的变量时......汇编突然变得不同了。
没什么意外。代码的微小变化通常会对生成的代码产生显著影响。优化器并不是魔法,它们只是非常复杂的模式匹配器。你改变了模式!
当然,一个优化编译器可以使用两个条件移动来生成无分支代码。实际上,这正是 Clang 3.9 为 test3 做的(但对于 test2 不是这样的,这与我们上面的分析一致,即优化器可能更能够识别标准模式而不是不同寻常的模式)。但 GCC 不会这样做。同样,不能保证执行特定的优化。
看起来唯一的区别在于顶部的 "mov" 是在 "je" 之前还是之后执行。
现在(抱歉我的汇编有点粗糙),难道把 mov 放在跳转后面总是更好的选择,以避免流水线刷新吗?
不,不完全是。在这种情况下,这并不会改善代码。如果分支预测错误,无论如何都会有一个流水线刷新。不管是一个 ret 指令还是一个 mov 指令,预测错误的代码都没有太大关系。
唯一需要关注条件分支紧接着的ret
指令的原因是,如果您手动编写汇编代码并不知道使用rep ret
指令,则需要使用这种针对某些AMD处理器的技巧,避免分支预测惩罚。除非您是一个汇编大师,否则您可能不知道这个技巧。但编译器知道,并且当您专门针对英特尔处理器或不具有此怪癖的不同代AMD处理器时,它也知道不必要。
然而,你可能是对的,将mov
放在分支之后可能更好,但原因不是你提出的那个。现代处理器(我相信这是 Nehalem 及其后续版本,但如果需要验证,我会查看Agner Fog 的优化指南)在某些情况下能够进行宏操作融合。基本上,宏操作融合意味着 CPU 的解码器将两个符合条件的指令组合成一个微操作,在流水线的所有阶段节省带宽。在 test3
中所看到的 cmp
或 test
指令后跟条件分支指令就符合宏操作融合的条件(实际上,还有其他必须满足的条件,但此代码符合这些要求)。在 test2
中在 cmp
和 je
之间安排其他指令会使宏操作融合变得不可能,潜在地使代码执行速度更慢。
可以说,这是编译器中的一处优化缺陷。它本应该重新排列mov指令,将je指令紧接在cmp之后,以保留宏操作融合的能力。
test2a(char, int*, int*):
mov DWORD PTR [rsi], 0 ; do the default initialization *first*
mov DWORD PTR [rdx], 0
cmp dil, 99 ; this is now followed immediately by the conditional
je .L10 ; branch, making macro-op fusion possible
rep ret
.L10:
mov DWORD PTR [rsi], 15
mov DWORD PTR [rdx], 21
ret
另一个区别是test2和test3的目标代码大小不同。由于优化器发出的填充以对齐分支目标,test3的代码比test2多4个字节。虽然这种差异不太可能有影响,特别是如果此代码未在保证热缓存中运行的紧密循环内执行。那么,这是否意味着您应该始终像test2中所做的那样编写代码呢?好吧,不是的,原因有几个:
- As we have seen, it might be a pessimization, since the optimizer may not recognize the pattern.
- You should write code for readability and semantic correctness first, only going back to optimize it when your profiler indicates that it is actually a bottleneck. And then, you should only optimize after inspecting and verifying the object code emitted by your compiler, otherwise you could end up with a pessimization. (The standard "trust your compiler until proven otherwise" advice.)
Even though it may be optimal in certain very simple cases, the "preset" idiom is not generalizable. If your initialization is time-consuming, it may be faster to skip over it when possible. (There is one example discussed here, in the context of VB 6, where string-manipulation is so slow that eliding it when possible actually results in faster execution time than fancy branchless code. More generally, the same rationale would apply if you were able to branch around a function call.)
Even here, where it appears to result in very simple and possibly more optimal code, it may actually be slower because you are writing to memory twice in the case where c
is equal to 99, and saving nothing in the case where c
is not equal to 99.
You might save this cost by rewriting the code such that it accumulates the final value in a temporary register, only storing it to memory at the end, e.g.:
test2b(char, int*, int*):
xor eax, eax ; pre-zero the EAX register
xor ecx, ecx ; pre-zero the ECX register
cmp dil, 99
je Done
mov eax, 15 ; change the value in EAX if necessary
mov ecx, 21 ; change the value in ECX if necessary
Done:
mov DWORD PTR [rsi], eax ; store our final temp values to memory
mov DWORD PTR [rdx], ecx
ret
but this clobbers two additional registers (eax
and ecx
) and may not actually be faster. You'd have to benchmark it. Or trust the compiler to emit this code when it is actually optimal, such as when it has inlined a function like test2
within a tight loop.
Even if you could guarantee that writing the code in a certain way would cause the compiler to emit branchless code, this would not necessarily be faster! While branches are slow when they are mispredicted, mispredictions are actually quite rare. Modern processors have extremely good branch prediction engines, achieving prediction accuracies of greater than 99% in most cases.
Conditional moves are great for avoiding branch mispredictions, but they have the important disadvantage of increasing the length of a dependency chain. By contrast, a correctly predicted branch breaks the dependency chain. (This is probably why GCC doesn't emit two CMOV instructions when you add the extra variable.) A conditional move is only a performance win if you expect branch prediction to fail. If you can count on a prediction success rate of ~75% or better, a conditional branch is probably faster, because it breaks the dependency chain and has a lower latency. And I would suspect that would be the case here, unless c
alternates rapidly back and forth between 99 and not-99 each time the function is called. (See Agner Fog's "Optimizing subroutines in assembly language", pp 70–71.)
__builtin_expect()
的likely
/unlikely
宏) - Peter Cordes*x
不能与y
别名,只能与*y
别名,我同意添加restrict
在实践或理论上都不重要。 - Peter Cordes