在char-by-char迭代方面,为什么j.l.String的性能差异显著?

6

我尝试了两种迭代java.lang.String中的字符,但发现它们很令人困惑。基准测试总结如下:

@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.NANOSECONDS)
@Fork(jvmArgsAppend = {"-Xms2g", "-Xmx2g"})
public class CharByCharIterationBenchmark {

  @Benchmark
  public void toCharArray(Data data, Blackhole b) {
    char[] chars = data.string.toCharArray();
    for (char ch : chars) {
      b.consume(ch);
    }
  }

  @Benchmark
  public void charAt(Data data, Blackhole b) {
    String string = data.string;
    int length = string.length();
    for (int i = 0; i < length; i++) {
      b.consume(string.charAt(i));
    }
  }

  @State(Scope.Thread)
  public static class Data {
    String string;

    @Param({"true", "false"})
    private boolean latin;

    @Param({"5", "10", "50", "100"})
    private int length;

    @Setup
    public void setup() {
      String alphabet = latin
        ? "abcdefghijklmnopqrstuvwxyz"        // English
        : "абвгдеёжзиклмнопрстуфхцчшщьыъэюя"; // Russian

      RandomStringGenerator generator = new RandomStringGenerator();

      string = generator.randomString(alphabet, length);
    }
  }

直观上看,toCharArray() 方法似乎不太有效,因为它会在 Java 8 中分配基础 char[] 的副本,并在 Java 9 及以上版本中将 byte[] 编码为 char[]。但实际上,情况恰恰相反:toCharArray() 的性能要快得多:

Java 8

                                 (latin)  (length)  Mode      Score     Error   Units
charAt                              true         5  avgt     21.051 ±   0.796   ns/op
charAt                              true        10  avgt     44.002 ±   2.324   ns/op
charAt                              true        50  avgt    221.068 ±   7.422   ns/op
charAt                              true       100  avgt    410.162 ±  13.441   ns/op

toCharArray                         true         5  avgt     16.819 ±   0.662   ns/op
toCharArray                         true        10  avgt     28.364 ±   0.663   ns/op
toCharArray                         true        50  avgt    110.910 ±   1.144   ns/op
toCharArray                         true       100  avgt    205.694 ±   1.669   ns/op

charAt:·gc.alloc.rate.norm          true         5  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          true        10  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          true        50  avgt     ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm          true       100  avgt     ≈ 10⁻⁴              B/op

toCharArray:·gc.alloc.rate.norm     true         5  avgt     32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true        10  avgt     40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true        50  avgt    120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     true       100  avgt    216.000 ±   0.001    B/op

charAt                              false        5  avgt     20.372 ±   0.406   ns/op
charAt                              false       10  avgt     39.962 ±   0.911   ns/op
charAt                              false       50  avgt    201.337 ±   3.752   ns/op
charAt                              false      100  avgt    410.530 ±  17.931   ns/op

toCharArray                         false        5  avgt     15.767 ±   0.606   ns/op
toCharArray                         false       10  avgt     26.258 ±   0.345   ns/op
toCharArray                         false       50  avgt    109.631 ±   1.064   ns/op
toCharArray                         false      100  avgt    205.815 ±   4.716   ns/op

charAt:·gc.alloc.rate.norm          false        5  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false       10  avgt     ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false       50  avgt     ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm          false      100  avgt     ≈ 10⁻⁴              B/op

toCharArray:·gc.alloc.rate.norm     false        5  avgt     32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       10  avgt     40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       50  avgt    120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false      100  avgt    216.000 ±   0.001    B/op


Java 11


                                  (latin)  (length)  Mode     Score     Error   Units
charAt                               true         5  avgt    22.035 ±   1.557   ns/op
charAt                               true        10  avgt    41.800 ±   1.572   ns/op
charAt                               true        50  avgt   227.180 ±  18.655   ns/op
charAt                               true       100  avgt   474.719 ±  29.782   ns/op

toCharArray                          true         5  avgt    17.091 ±   0.662   ns/op
toCharArray                          true        10  avgt    26.167 ±   0.220   ns/op
toCharArray                          true        50  avgt   127.876 ±   2.106   ns/op
toCharArray                          true       100  avgt   244.449 ±   9.330   ns/op

charAt:·gc.alloc.rate.norm           true         5  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm           true        10  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm           true        50  avgt    ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm           true       100  avgt    ≈ 10⁻⁴              B/op

toCharArray:·gc.alloc.rate.norm      true         5  avgt    32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true        10  avgt    40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true        50  avgt   120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm      true       100  avgt   216.000 ±   0.001    B/op

charAt                              false         5  avgt    22.215 ±   2.064   ns/op
charAt                              false        10  avgt    45.606 ±   2.567   ns/op
charAt                              false        50  avgt   204.577 ±  18.302   ns/op
charAt                              false       100  avgt   404.056 ±  10.203   ns/op

toCharArray                         false         5  avgt    17.055 ±   0.556   ns/op
toCharArray                         false        10  avgt    29.254 ±   2.616   ns/op
toCharArray                         false        50  avgt   123.610 ±   5.033   ns/op
toCharArray                         false       100  avgt   226.174 ±   6.396   ns/op

charAt:·gc.alloc.rate.norm          false         5  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false        10  avgt    ≈ 10⁻⁵              B/op
charAt:·gc.alloc.rate.norm          false        50  avgt    ≈ 10⁻⁴              B/op
charAt:·gc.alloc.rate.norm          false       100  avgt    ≈ 10⁻⁴              B/op

toCharArray:·gc.alloc.rate.norm     false         5  avgt    32.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false        10  avgt    40.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false        50  avgt   120.000 ±   0.001    B/op
toCharArray:·gc.alloc.rate.norm     false       100  avgt   216.000 ±   0.001    B/op

起初,我认为这里的原因与Nitsan Wakart在他的文章“Volatile read surprise”中描述的一样。然而,通过使用perfasm进行分析,我发现代码中最热门的部分与char[]/byte[]字段访问无关:

           ╭│     0x00007fa638407dd9: jmp    0x00007fa638407e4c
           ││     0x00007fa638407ddb: nopl   0x0(%rax,%rax,1)
  4.96%    ││  ↗  0x00007fa638407de0: shl    $0x3,%r11
  0.01%    ││  │  0x00007fa638407de4: movzwl 0x10(%r11,%r13,2),%edx  ;*invokevirtual charAt {reexecute=0 rethrow=0 return_oop=0}
           ││  │                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
           ││  │                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  3.58%    ││ ↗│  0x00007fa638407dea: mov    %rsi,0x18(%rsp)
  1.87%    ││ ││  0x00007fa638407def: mov    %r8d,0x14(%rsp)
  4.18%    ││ ││  0x00007fa638407df4: mov    %edi,0x10(%rsp)
  0.04%    ││ ││  0x00007fa638407df8: mov    %rbx,0x8(%rsp)
  1.29%    ││ ││  0x00007fa638407dfd: mov    %r10,(%rsp)
  1.83%    ││ ││  0x00007fa638407e01: mov    %r9,0x70(%rsp)
  4.32%    ││ ││  0x00007fa638407e06: mov    %rax,0x60(%rsp)
  0.05%    ││ ││  0x00007fa638407e0b: mov    %r9,%rsi
  1.27%    ││ ││  0x00007fa638407e0e: nop
  1.88%    ││ ││  0x00007fa638407e0f: callq  0x00007fa630926e00  ; ImmutableOopMap{[96]=Oop [104]=Oop [112]=Oop [120]=Oop [0]=Oop [16]=NarrowOop [24]=Oop }
           ││ ││                                                ;*invokevirtual consume {reexecute=0 rethrow=0 return_oop=0}
           ││ ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@28 (line 35)
           ││ ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
           ││ ││                                                ;   {optimized virtual_call}
  5.71%    ││ ││  0x00007fa638407e14: inc    %ebp               ;*iinc {reexecute=0 rethrow=0 return_oop=0}
           ││ ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@31 (line 34)
           ││ ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  0.05%    ││ ││  0x00007fa638407e16: cmp    0x14(%rsp),%ebp
  0.00%    │╰ ││  0x00007fa638407e1a: jge    0x00007fa638407d87  ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
           │  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@18 (line 34)
           │  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  3.05%    │  ││  0x00007fa638407e20: mov    0x10(%rsp),%edi
  4.24%    │  ││  0x00007fa638407e24: movsbl 0x14(%r12,%rdi,8),%ecx  ;*getfield coder {reexecute=0 rethrow=0 return_oop=0}
           │  ││                                                ; - java.lang.String::isLatin1@7 (line 3266)
           │  ││                                                ; - java.lang.String::charAt@1 (line 692)
           │  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
           │  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  0.86%    │  ││  0x00007fa638407e2a: mov    0xc(%r12,%rdi,8),%r11d  ;*getfield value {reexecute=0 rethrow=0 return_oop=0}
           │  ││                                                ; - java.lang.String::charAt@8 (line 693)
           │  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
           │  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  1.67%    │  ││  0x00007fa638407e2f: mov    0x60(%rsp),%rax
  1.71%    │  ││  0x00007fa638407e34: mov    0x70(%rsp),%r9
  3.92%    │  ││  0x00007fa638407e39: mov    (%rsp),%r10
  0.20%    │  ││  0x00007fa638407e3d: mov    0x8(%rsp),%rbx
  1.44%    │  ││  0x00007fa638407e42: mov    0x14(%rsp),%r8d
  1.70%    │  ││  0x00007fa638407e47: mov    0x18(%rsp),%rsi    ;*aload_2 {reexecute=0 rethrow=0 return_oop=0}
           │  ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@21 (line 35)
           │  ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  3.93%    ↘  ││  0x00007fa638407e4c: movslq %ebp,%r13          ;*invokestatic getChar {reexecute=0 rethrow=0 return_oop=0}
              ││                                                ; - java.lang.StringUTF16::charAt@7 (line 1268)
              ││                                                ; - java.lang.String::charAt@21 (line 695)
              ││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
              ││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  0.23%       ││  0x00007fa638407e4f: test   %ecx,%ecx
  0.00%      ╭││  0x00007fa638407e51: jne    0x00007fa638407e6b  ;*ifeq {reexecute=0 rethrow=0 return_oop=0}
             │││                                                ; - java.lang.String::charAt@4 (line 692)
             │││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
             │││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
             │││  0x00007fa638407e53: mov    0xc(%r12,%r11,8),%edx  ; implicit exception: dispatches to 0x00007fa638407fbc
             │││  0x00007fa638407e58: cmp    %edx,%ebp
             │││  0x00007fa638407e5a: jae    0x00007fa638407eb0
             │││  0x00007fa638407e5c: shl    $0x3,%r11
             │││  0x00007fa638407e60: movzbl 0x10(%r11,%r13,1),%edx  ;*iand {reexecute=0 rethrow=0 return_oop=0}
             │││                                                ; - java.lang.StringLatin1::charAt@25 (line 49)
             │││                                                ; - java.lang.String::charAt@12 (line 693)
             │││                                                ; - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35)
             │││                                                ; - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
             │╰│  0x00007fa638407e66: jmpq   0x00007fa638407dea
  1.52%      ↘ │  0x00007fa638407e6b: mov    0xc(%r12,%r11,8),%ecx  ; implicit exception: dispatches to 0x00007fa638407fb0
  5.99%        │  0x00007fa638407e70: sar    %ecx               ;*ishr {reexecute=0 rethrow=0 return_oop=0}; - java.lang.StringUTF16::length@3 (line 74); - java.lang.StringUTF16::checkIndex@2 (line 1470); - java.lang.StringUTF16::charAt@2 (line 1267); - java.lang.String::charAt@21 (line 695); - tsypanov.strings.character.CharByCharIterationBenchmark::charAt@25 (line 35); - tsypanov.strings.character.generated.CharByCharIterationBenchmark_charAt_jmhTest::charAt_avgt_jmhStub@19 (line 191)
  5.51%        │  0x00007fa638407e72: cmp    %ecx,%ebp
  0.01%        ╰  0x00007fa638407e74: jb     0x00007fa638407de0  ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}

看起来最热门的地方是iinc(循环索引增量)和ishr(算术位移)在StringUTF16.length()中,这对我来说相当反直觉。

此外,使用perfnorm分析器,我发现toCharArray()的循环次数、指令数量和加载缺失比charAt()少:

Benchmark                          Mode  Cnt     Score   Error  Units

charAt:L1-dcache-loads             avgt       2104.816           #/op
charAt:L1-dcache-stores            avgt       1200.878           #/op
charAt:branches                    avgt        603.754           #/op
charAt:cycles                      avgt       1461.282           #/op
charAt:dTLB-loads                  avgt       2105.253           #/op
charAt:dTLB-stores                 avgt       1200.909           #/op
charAt:instructions                avgt       4716.775           #/op

toCharArray:L1-dcache-loads        avgt       1026.341           #/op
toCharArray:L1-dcache-stores       avgt        416.997           #/op
toCharArray:branches               avgt        419.265           #/op
toCharArray:cycles                 avgt        820.521           #/op
toCharArray:dTLB-loads             avgt       1026.506           #/op
toCharArray:dTLB-stores            avgt        417.591           #/op
toCharArray:instructions           avgt       2409.806           #/op

有人可以帮忙解释这个并解释如此显著的差异吗?


直观地说,每次调用 charAt 时,它都会检查字符串是否以 UTF-16 或 Latin1 编码,并检查传递的索引是否在范围内。对于数组访问,边界检查可以优化为检查循环边界而不是检查每个索引。 - RealSkeptic
3
toCharArray没有进行标量替换。请注意结果中的gc.alloc.rate.norm - apangin
3
问题在于 BlackHole.consume 方法被放在了循环的内部。由于该方法是一个非内联的黑盒子方法,它阻止了在调用周围优化代码,特别是缓存字符串字段。 - apangin
顺便提一下,iincishr不是最热门的。性能通常将执行时间归因于以下指令(查找“perf event skid”)。 - apangin
请参见 https://dev59.com/57fna4cB1Zd3GeqPzdBn#59116238。 - apangin
显示剩余2条评论
1个回答

1

如@apangin在他的评论中提到的那样。

问题是BlackHole.consume在循环内部被调用。作为一个非内联黑盒方法,它阻止了在调用周围优化代码,特别是缓存String字段。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接