我很快就会提供CPU如何翻译的方法并更新这篇文章,但与此同时,您看到的差异太小,不值得关注。
Java中的字节码并不能说明一个方法执行的快慢,因为有两个JIT编译器,一旦它们变热了,这个方法会完全不同。另外,javac
编译代码后做的优化很少,真正的优化来自 JIT
。
我使用 JMH
进行了一些测试,只使用 C1
编译器或将 C2
替换为 GraalVM
或完全没有 JIT
...(接下来是大量测试代码,您可以跳过它并查看结果,这是在 jdk-12
下完成的)。这段代码使用 JMH - 这是 Java 微基准测试领域中使用的事实标准工具(如果手动进行测试容易出现错误)。
@Warmup(iterations = 10)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Measurement(iterations = 2, time = 2, timeUnit = TimeUnit.SECONDS)
public class BooleanCompare {
public static void main(String[] args) throws Exception {
Options opt = new OptionsBuilder()
.include(BooleanCompare.class.getName())
.build();
new Runner(opt).run();
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(1)
public boolean xor(BooleanExecutionPlan plan) {
return plan.booleans()[0] ^ plan.booleans()[1];
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(1)
public boolean plain(BooleanExecutionPlan plan) {
return plan.booleans()[0] != plan.booleans()[1];
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1, jvmArgsAppend = "-Xint")
public boolean xorNoJIT(BooleanExecutionPlan plan) {
return plan.booleans()[0] != plan.booleans()[1];
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1, jvmArgsAppend = "-Xint")
public boolean plainNoJIT(BooleanExecutionPlan plan) {
return plan.booleans()[0] != plan.booleans()[1];
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1, jvmArgsAppend = "-XX:-TieredCompilation")
public boolean xorC2Only(BooleanExecutionPlan plan) {
return plan.booleans()[0] != plan.booleans()[1];
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1, jvmArgsAppend = "-XX:-TieredCompilation")
public boolean plainC2Only(BooleanExecutionPlan plan) {
return plan.booleans()[0] != plan.booleans()[1];
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1, jvmArgsAppend = "-XX:TieredStopAtLevel=1")
public boolean xorC1Only(BooleanExecutionPlan plan) {
return plan.booleans()[0] != plan.booleans()[1];
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1, jvmArgsAppend = "-XX:TieredStopAtLevel=1")
public boolean plainC1Only(BooleanExecutionPlan plan) {
return plan.booleans()[0] != plan.booleans()[1];
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1,
jvmArgsAppend = {
"-XX:+UnlockExperimentalVMOptions",
"-XX:+EagerJVMCI",
"-Dgraal.ShowConfiguration=info",
"-XX:+UseJVMCICompiler",
"-XX:+EnableJVMCI"
})
public boolean xorGraalVM(BooleanExecutionPlan plan) {
return plan.booleans()[0] != plan.booleans()[1];
}
@Benchmark
@BenchmarkMode(Mode.AverageTime)
@Fork(value = 1,
jvmArgsAppend = {
"-XX:+UnlockExperimentalVMOptions",
"-XX:+EagerJVMCI",
"-Dgraal.ShowConfiguration=info",
"-XX:+UseJVMCICompiler",
"-XX:+EnableJVMCI"
})
public boolean plainGraalVM(BooleanExecutionPlan plan) {
return plan.booleans()[0] != plan.booleans()[1];
}
}
以下是结果:
BooleanCompare.plain avgt 2 3.125 ns/op
BooleanCompare.xor avgt 2 2.976 ns/op
BooleanCompare.plainC1Only avgt 2 3.400 ns/op
BooleanCompare.xorC1Only avgt 2 3.379 ns/op
BooleanCompare.plainC2Only avgt 2 2.583 ns/op
BooleanCompare.xorC2Only avgt 2 2.685 ns/op
BooleanCompare.plainGraalVM avgt 2 2.980 ns/op
BooleanCompare.xorGraalVM avgt 2 3.868 ns/op
BooleanCompare.plainNoJIT avgt 2 243.348 ns/op
BooleanCompare.xorNoJIT avgt 2 201.342 ns/op
我不是一个能够熟练阅读汇编语言的人,尽管有时候我会喜欢这样做...这里有一些有趣的事情。如果我们做:
只使用“!=”运算符的C1编译器
public static boolean compare(boolean left, boolean right) {
return left != right;
}
我们得到:
0x000000010d1b2bc7: push %rbp
0x000000010d1b2bc8: sub $0x30,%rsp
0x000000010d1b2bcc: cmp %edx,%esi
0x000000010d1b2bce: mov $0x0,%eax
0x000000010d1b2bd3: je 0x000000010d1b2bde
0x000000010d1b2bd9: mov $0x1,%eax
0x000000010d1b2bde: and $0x1,%eax
0x000000010d1b2be1: add $0x30,%rsp
0x000000010d1b2be5: pop %rbp
对我来说,这段代码有点显而易见:将0放入
eax
,
compare (edx, esi)
-> 如果不相等,则将1放入
eax
。返回
eax & 1
。
带^的C1编译器:
public static boolean compare(boolean left, boolean right) {
return left ^ right
}
# parm0: rsi = boolean
# parm1: rdx = boolean
# [sp+0x40] (sp of caller)
0x000000011326e5c0: mov %eax,-0x14000(%rsp)
0x000000011326e5c7: push %rbp
0x000000011326e5c8: sub $0x30,%rsp
0x000000011326e5cc: xor %rdx,%rsi
0x000000011326e5cf: and $0x1,%esi
0x000000011326e5d2: mov %rsi,%rax
0x000000011326e5d5: add $0x30,%rsp
0x000000011326e5d9: pop %rbp
我真的不知道为什么这里需要and $0x1,%esi
,否则这也很简单,我想。
但是如果我启用C2编译器,事情会变得更加有趣。
/**
* run with java
* -XX:+UnlockDiagnosticVMOptions
* -XX:CICompilerCount=2
* -XX:-TieredCompilation
* "-XX:CompileCommand=print,com/so/BooleanCompare.compare"
* com.so.BooleanCompare
*/
public static boolean compare(boolean left, boolean right) {
return left != right
}
# parm0: rsi = boolean
# parm1: rdx = boolean
# [sp+0x20] (sp of caller)
0x000000011a2bbfa0: sub $0x18,%rsp
0x000000011a2bbfa7: mov %rbp,0x10(%rsp)
0x000000011a2bbfac: xor %r10d,%r10d
0x000000011a2bbfaf: mov $0x1,%eax
0x000000011a2bbfb4: cmp %edx,%esi
0x000000011a2bbfb6: cmove %r10d,%eax
0x000000011a2bbfba: add $0x10,%rsp
0x000000011a2bbfbe: pop %rbp
我甚至没有看到经典的结尾语句push ebp; mov ebp, esp; sub esp, x
,取而代之的是一些非常不同寻常的内容(至少对我来说是这样),方法如下:
sub $0x18,%rsp
mov %rbp,0x10(%rsp)
....
add $0x10,%rsp
pop %rbp
希望比我更熟练的人能够解释。否则,它就像是一个更好的C1
生成版本:
xor %r10d,%r10d // put zero into r10d
mov $0x1,%eax // put 1 into eax
cmp %edx,%esi // compare edx and esi
cmove %r10d,%eax // conditionally move the contents of r10d into eax
据我所知,
cmp/cmove
优于
cmp/je
,原因是分支预测 - 至少这是我读过的...
C2 编译器中的 XOR:
public static boolean compare(boolean left, boolean right) {
return left ^ right
}
0x000000010e6c9a20: sub $0x18,%rsp
0x000000010e6c9a27: mov %rbp,0x10(%rsp)
0x000000010e6c9a2c: xor %edx,%esi
0x000000010e6c9a2e: mov %esi,%eax
0x000000010e6c9a30: and $0x1,%eax
0x000000010e6c9a33: add $0x10,%rsp
0x000000010e6c9a37: pop %rbp
看起来它几乎与C1
编译器生成的代码相同。
p != q
表示使用比较指令,而p ^ q
建议使用xor
指令。这是在字节码中看到的。如果以这种自然方式进一步编译成机器代码,那么如果结果用作数字或存储到内存,则p ^ q
可能会更快,但如果用作分支条件,则稍微慢一些。 - zchxor
和它的标志位,仍然可能在某些情况下损害优化,因为它会改变保存p
(或q
)的寄存器。 - zchmov
)非常便宜/免费,因为它具有重命名功能。 - Cody Gray