为什么没有垃圾收集开销时，这个测试需要更长的时间？

Question

为什么没有垃圾收集开销时，这个测试需要更长的时间？

javaperformancejvmgarbage-collectiongarbage

13

我在开发一个轻量级的异步消息库的过程中遇到了这种情况。为了了解创建大量寿命短暂的中等大小对象的成本，我编写了以下测试：

import java.nio.ByteBuffer;
import java.util.Random;


public class MemPressureTest {
    static final int SIZE = 4096;
    static final class Bigish {
        final ByteBuffer b;


        public Bigish() {
            this(ByteBuffer.allocate(SIZE));
        }

        public Bigish(ByteBuffer b) {
            this.b = b;
        }

        public void fill(byte bt) {
            b.clear();
            for (int i = 0; i < SIZE; ++i) {
                b.put(bt);
            }
        }
    }


    public static void main(String[] args) {
        Random random = new Random(1);
        Bigish tmp = new Bigish();
        for (int i = 0; i < 3e7; ++i) {
            tmp.fill((byte)random.nextInt(255));
        }
    }
}

在我的笔记本电脑上，使用默认的GC设置，它大约需要运行95秒：

/tmp$ time java -Xlog:gc MemPressureTest
[0.006s][info][gc] Using G1

real    1m35.552s
user    1m33.658s
sys 0m0.428s

事情就在这里变得有些奇怪了。我调整了程序，在每次迭代中分配一个新对象：

...
        Random random = new Random(1);
        for (int i = 0; i < 3e7; ++i) {
            Bigish tmp = new Bigish();
            tmp.fill((byte)random.nextInt(255));
        }
...

理论上，这应该会增加一些小的开销，但是没有任何对象应该被提升出Eden区。最好情况下，我期望运行时间非常接近。然而，这个测试只需要约17秒完成：

/tmp$ time java -Xlog:gc MemPressureTest
[0.007s][info][gc] Using G1
[0.090s][info][gc] GC(0) Pause Young (Normal) (G1 Evacuation Pause) 23M->1M(130M) 1.304ms
[0.181s][info][gc] GC(1) Pause Young (Normal) (G1 Evacuation Pause) 76M->1M(130M) 0.870ms
[0.247s][info][gc] GC(2) Pause Young (Normal) (G1 Evacuation Pause) 76M->0M(130M) 0.844ms
[0.317s][info][gc] GC(3) Pause Young (Normal) (G1 Evacuation Pause) 75M->0M(130M) 0.793ms
[0.381s][info][gc] GC(4) Pause Young (Normal) (G1 Evacuation Pause) 75M->0M(130M) 0.859ms
[lots of similar GC pauses, snipped for brevity]
[16.608s][info][gc] GC(482) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.765ms
[16.643s][info][gc] GC(483) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.580ms
[16.676s][info][gc] GC(484) Pause Young (Normal) (G1 Evacuation Pause) 254M->0M(425M) 0.841ms

real    0m16.766s
user    0m16.578s
sys 0m0.576s

我多次运行了这两个版本，结果与上面几乎完全相同。我感觉自己一定漏掉了什么很明显的东西。我是不是要疯了？有什么原因可以解释这种性能差异吗？

=== 编辑 ===

我按照apangin和dan1st的建议改写了测试代码，使用了JMH：

import org.openjdk.jmh.annotations.*;
import org.openjdk.jmh.infra.Blackhole;

import java.nio.ByteBuffer;
import java.util.Random;


public class MemPressureTest {
    static final int SIZE = 4096;

    @State(Scope.Benchmark)
    public static class Bigish {
        final ByteBuffer b;
        private Blackhole blackhole;


        @Setup(Level.Trial)
        public void setup(Blackhole blackhole) {
            this.blackhole = blackhole;
        }

        public Bigish() {
            this.b = ByteBuffer.allocate(SIZE);
        }

        public void fill(byte bt) {
            b.clear();
            for (int i = 0; i < SIZE; ++i) {
                b.put(bt);
            }
            blackhole.consume(b);
        }
    }

    static Random random = new Random(1);


    @Benchmark
    public static void test1(Blackhole blackhole) {
        Bigish tmp = new Bigish();
        tmp.setup(blackhole);
        tmp.fill((byte)random.nextInt(255));
        blackhole.consume(tmp);
    }

    @Benchmark
    public static void test2(Bigish perm) {
        perm.fill((byte) random.nextInt(255));
    }
}

然而，第二个测试速度要慢得多：

> Task :jmh
# JMH version: 1.35
# VM version: JDK 18.0.1.1, OpenJDK 64-Bit Server VM, 18.0.1.1+2-6
# VM invoker: /Users/xxx/Library/Java/JavaVirtualMachines/openjdk-18.0.1.1/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/xxx/Dev/MemTests/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.xxx.MemPressureTest.test1

# Run progress: 0.00% complete, ETA 00:16:40
# Fork: 1 of 5
# Warmup Iteration   1: 2183998.556 ops/s
# Warmup Iteration   2: 2281885.941 ops/s
# Warmup Iteration   3: 2239644.018 ops/s
# Warmup Iteration   4: 1608047.994 ops/s
# Warmup Iteration   5: 1992314.001 ops/s
Iteration   1: 2053657.571 ops/s3s]
Iteration   2: 2054957.773 ops/sm 3s]
Iteration   3: 2051595.233 ops/sm 13s]
Iteration   4: 2054878.239 ops/sm 23s]
Iteration   5: 2031111.214 ops/sm 33s]

# Run progress: 10.00% complete, ETA 00:15:04
# Fork: 2 of 5
# Warmup Iteration   1: 2228594.345 ops/s
# Warmup Iteration   2: 2257983.355 ops/s
# Warmup Iteration   3: 2063130.244 ops/s
# Warmup Iteration   4: 1629084.669 ops/s
# Warmup Iteration   5: 2063018.496 ops/s
Iteration   1: 1939260.937 ops/sm 33s]
Iteration   2: 1791414.018 ops/sm 43s]
Iteration   3: 1914987.221 ops/sm 53s]
Iteration   4: 1969484.898 ops/sm 3s]
Iteration   5: 1891440.624 ops/sm 13s]

# Run progress: 20.00% complete, ETA 00:13:23
# Fork: 3 of 5
# Warmup Iteration   1: 2228664.719 ops/s
# Warmup Iteration   2: 2263677.403 ops/s
# Warmup Iteration   3: 2237032.464 ops/s
# Warmup Iteration   4: 2040040.243 ops/s
# Warmup Iteration   5: 2038848.530 ops/s
Iteration   1: 2023934.952 ops/sm 14s]
Iteration   2: 2041874.437 ops/sm 24s]
Iteration   3: 2002858.770 ops/sm 34s]
Iteration   4: 2039727.607 ops/sm 44s]
Iteration   5: 2045827.670 ops/sm 54s]

# Run progress: 30.00% complete, ETA 00:11:43
# Fork: 4 of 5
# Warmup Iteration   1: 2105430.688 ops/s
# Warmup Iteration   2: 2279387.762 ops/s
# Warmup Iteration   3: 2228346.691 ops/s
# Warmup Iteration   4: 1438607.183 ops/s
# Warmup Iteration   5: 2059319.745 ops/s
Iteration   1: 1112543.932 ops/sm 54s]
Iteration   2: 1977077.976 ops/sm 4s]
Iteration   3: 2040147.355 ops/sm 14s]
Iteration   4: 1975766.032 ops/sm 24s]
Iteration   5: 2003532.092 ops/sm 34s]

# Run progress: 40.00% complete, ETA 00:10:02
# Fork: 5 of 5
# Warmup Iteration   1: 2240203.848 ops/s
# Warmup Iteration   2: 2245673.994 ops/s
# Warmup Iteration   3: 2096257.864 ops/s
# Warmup Iteration   4: 2046527.740 ops/s
# Warmup Iteration   5: 2050379.941 ops/s
Iteration   1: 2050691.989 ops/sm 35s]
Iteration   2: 2057803.100 ops/sm 45s]
Iteration   3: 2058634.766 ops/sm 55s]
Iteration   4: 2060596.595 ops/sm 5s]
Iteration   5: 2061282.107 ops/sm 15s]


Result "com.xxx.MemPressureTest.test1":
  1972203.484 ±(99.9%) 142904.698 ops/s [Average]
  (min, avg, max) = (1112543.932, 1972203.484, 2061282.107), stdev = 190773.683
  CI (99.9%): [1829298.786, 2115108.182] (assumes normal distribution)


# JMH version: 1.35
# VM version: JDK 18.0.1.1, OpenJDK 64-Bit Server VM, 18.0.1.1+2-6
# VM invoker: /Users/xxx/Library/Java/JavaVirtualMachines/openjdk-18.0.1.1/Contents/Home/bin/java
# VM options: -Dfile.encoding=UTF-8 -Djava.io.tmpdir=/Users/xxx/Dev/MemTests/build/tmp/jmh -Duser.country=US -Duser.language=en -Duser.variant
# Blackhole mode: compiler (auto-detected, use -Djmh.blackhole.autoDetect=false to disable)
# Warmup: 5 iterations, 10 s each
# Measurement: 5 iterations, 10 s each
# Timeout: 10 min per iteration
# Threads: 1 thread, will synchronize iterations
# Benchmark mode: Throughput, ops/time
# Benchmark: com.xxx.MemPressureTest.test2

# Run progress: 50.00% complete, ETA 00:08:22
# Fork: 1 of 5
# Warmup Iteration   1: 282751.407 ops/s
# Warmup Iteration   2: 283333.984 ops/s
# Warmup Iteration   3: 293785.079 ops/s
# Warmup Iteration   4: 268403.105 ops/s
# Warmup Iteration   5: 280054.277 ops/s
Iteration   1: 279093.118 ops/s9m 15s]
Iteration   2: 282782.996 ops/s9m 25s]
Iteration   3: 282688.921 ops/s9m 35s]
Iteration   4: 291578.963 ops/s9m 45s]
Iteration   5: 294835.777 ops/s9m 55s]

# Run progress: 60.00% complete, ETA 00:06:41
# Fork: 2 of 5
# Warmup Iteration   1: 283735.550 ops/s
# Warmup Iteration   2: 283536.547 ops/s
# Warmup Iteration   3: 294403.173 ops/s
# Warmup Iteration   4: 284161.042 ops/s
# Warmup Iteration   5: 281719.077 ops/s
Iteration   1: 276838.416 ops/s10m 56s]
Iteration   2: 284063.117 ops/s11m 6s]
Iteration   3: 282361.985 ops/s11m 16s]
Iteration   4: 289125.092 ops/s11m 26s]
Iteration   5: 294236.625 ops/s11m 36s]

# Run progress: 70.00% complete, ETA 00:05:01
# Fork: 3 of 5
# Warmup Iteration   1: 284567.336 ops/s
# Warmup Iteration   2: 283548.713 ops/s
# Warmup Iteration   3: 294317.511 ops/s
# Warmup Iteration   4: 283501.873 ops/s
# Warmup Iteration   5: 283691.306 ops/s
Iteration   1: 283462.749 ops/s12m 36s]
Iteration   2: 284120.587 ops/s12m 46s]
Iteration   3: 264878.952 ops/s12m 56s]
Iteration   4: 292681.168 ops/s13m 6s]
Iteration   5: 295279.759 ops/s13m 16s]

# Run progress: 80.00% complete, ETA 00:03:20
# Fork: 4 of 5
# Warmup Iteration   1: 284823.519 ops/s
# Warmup Iteration   2: 283913.207 ops/s
# Warmup Iteration   3: 294401.483 ops/s
# Warmup Iteration   4: 283998.027 ops/s
# Warmup Iteration   5: 283987.408 ops/s
Iteration   1: 278014.618 ops/s14m 17s]
Iteration   2: 283431.491 ops/s14m 27s]
Iteration   3: 284465.945 ops/s14m 37s]
Iteration   4: 293202.934 ops/s14m 47s]
Iteration   5: 290059.807 ops/s14m 57s]

# Run progress: 90.00% complete, ETA 00:01:40
# Fork: 5 of 5
# Warmup Iteration   1: 285598.809 ops/s
# Warmup Iteration   2: 284434.916 ops/s
# Warmup Iteration   3: 294355.547 ops/s
# Warmup Iteration   4: 284307.860 ops/s
# Warmup Iteration   5: 284297.362 ops/s
Iteration   1: 283676.043 ops/s15m 57s]
Iteration   2: 283609.750 ops/s16m 7s]
Iteration   3: 284575.124 ops/s16m 17s]
Iteration   4: 293564.269 ops/s16m 27s]
Iteration   5: 216267.883 ops/s16m 37s]


Result "com.xxx.MemPressureTest.test2":
  282755.844 ±(99.9%) 11599.112 ops/s [Average]
  (min, avg, max) = (216267.883, 282755.844, 295279.759), stdev = 15484.483
  CI (99.9%): [271156.731, 294354.956] (assumes normal distribution)

JMH Blackhole可以防止代码被优化，现在JMH负责运行单独的迭代，这样可以防止并行化，对吗？Blackhole难道不应该还可以阻止对象被限制在堆栈中吗？此外，如果Hotspot仍在进行大量优化，预热迭代之间的差异不会更大吗？

- 735Tesla

5

每次人们手动编写Java基准测试时，他们最终测量的是OSR编译。我已经多次证明这是错误的：1，2，3，4。不要从头开始编写微基准。使用JMH。 - apangin

2

感谢您的重写，现在基准测试有意义了。将“b.put(bt);”替换为“b.put(i, bt);”，结果变得符合预期。稍后我会写一个详细的答案。简而言之，JIT无法优化第二种情况下ByteBuffer内部状态的更新，因为该ByteBuffer及其支持数组的创建不在编译范围内。 - apangin

1

作为实际结果，从性能角度来看，使用绝对索引的 ByteBuffer.put 几乎总是优于相对 put。 - apangin

@apangin 谢谢！你有什么关于这个主题的书推荐吗？ - 735Tesla

3个回答

2

免责声明

以下内容仅为理论，可能完全错误。我既不是JIT专家也不是GC专家。

代码删除

我认为JIT只是优化了你的代码（部分）。如果是这样的话，它检测到你实际上没有使用存储的值，只需删除分配/填充对象的代码。像JmH的黑洞之类的东西可能会对此有所帮助。

并行化

也可能是它并行化了你的代码。由于循环的不同迭代相互独立，因此可以并行执行多个迭代。

堆栈分配

另一个可能性是它检测到对象被限制在堆栈中，并且范围非常狭窄，因此立即被删除。因此，它可能已将您的对象移动到堆栈中，从而可以快速分配/推送和释放/弹出。

结束语

JIT可能总是做出意外的事情。不要过早地进行优化，也不要猜测您的瓶颈在哪里。在进行任何更改之前，请测量您的性能。性能可能不会像您期望的那样消失。这也适用于其他语言，但尤其适用于Java。

并且，正如apangin在评论中提到的那样，您应该真正使用JMH。

- dan1st

1

你最初的问题和修改后的JMH版本实际上略有不同。

在修改后的版本中，就像@apangin提到的那样，存储在静态字段中的指针perm防止了代码被优化。

在你最初的问题中，是因为你忘记预热了。这是一个修改后的版本:

    public static void main(String[] args) {
        var t1 = System.currentTimeMillis();
        var warmup = Integer.parseInt(args[0]);
        for (int i = 0; i < warmup; i++) { test(1); }  // magic!!!
        test(1000000);
        var t2 = System.currentTimeMillis();
        System.out.println(t2 - t1);
    }
    
    private static void test(int n) {
        Random random = new Random(1);
        Bigish tmp = new Bigish();
        for (int i = 0; i < n; ++i) {
            tmp.fill((byte) random.nextInt(255));
        }
    }

它需要一个 int 参数 warmup 来帮助 JVM 决定哪些方法应该被内联。

在我的机器上，也就是 Windows 上的 OpenJDK Runtime Environment Zulu17.28+13-CA（构建 17+35-LTS），当 warmup 设置为 8000 时，输出是不可预测的。通常需要大约 2.7 秒，但偶尔只需要 110 毫秒。

当 warmup 设置为 8500 时，几乎总是在 110~120 毫秒内完成。

您还可以使用 -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining 选项运行，以查看 JVM 如何内联方法。如果一切都完全内联，则应该能够看到类似以下内容：

  @ 24   A$Bigish::<init> (11 bytes)   inline (hot)
    @ 4   java.nio.ByteBuffer::allocate (20 bytes)   inline (hot)
      @ 16   java.nio.HeapByteBuffer::<init> (21 bytes)   inline (hot)
        @ 10   java.nio.ByteBuffer::<init> (47 bytes)   inline (hot)
          @ 8   java.nio.Buffer::<init> (105 bytes)   inline (hot)
            @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)
            @ 39   java.nio.ByteBuffer::limit (6 bytes)   inline (hot)
              @ 2   java.nio.ByteBuffer::limit (8 bytes)   inline (hot)
                @ 2   java.nio.Buffer::limit (65 bytes)   inline (hot)
            @ 45   java.nio.ByteBuffer::position (6 bytes)   inline (hot)
              @ 2   java.nio.ByteBuffer::position (8 bytes)   inline (hot)
                @ 2   java.nio.Buffer::position (52 bytes)   inline (hot)
          @ 17   java.nio.ByteOrder::nativeOrder (4 bytes)   inline (hot)
    @ 7   A$Bigish::<init> (10 bytes)   inline (hot)
      @ 1   java.lang.Object::<init> (1 bytes)   inline (hot)

请注意，只有当Bigish和ByteBuffer的构造函数被完全内联时，JVM才能断定基础缓冲区永远不会对另一个线程可见，这样就可以安全地对缓冲区进行写入矢量化处理，最终使得性能更好。

顺便说一下，这又是一个展示基准测试有多棘手的案例。如果不深入了解详情，很难确定哪一部分是真正的性能瓶颈。即使是JMH也可能具有误导性。

- yyyy

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- apangin · Accepted Answer

在填充之前创建一个新的ByteBuffer确实有助于JIT编译器生成更好的优化代码，特别是当您使用相对的put方法时，原因如下。

JIT编译单元是一个方法。HotSpot JVM不执行整个程序的优化，这在理论上甚至很难实现，因为Java具有动态性和开放的运行时环境。
当JVM编译test1方法时，缓冲区实例化出现在与填充相同的编译范围内：

Bigish tmp = new Bigish();
tmp.setup(blackhole);
tmp.fill((byte)random.nextInt(255));

JVM了解所创建缓冲区的所有信息：其确切大小和其后备数组，它知道该缓冲区尚未发布，也没有其他线程看到它。因此，JVM可以积极地优化填充循环：使用AVX指令对其进行矢量化并展开以每次设置512字节：

  0x000001cdf60c9ae0:   mov    %r9d,%r8d
  0x000001cdf60c9ae3:   movslq %r8d,%r9
  0x000001cdf60c9ae6:   add    %r11,%r9
  0x000001cdf60c9ae9:   vmovdqu %ymm0,0x10(%rcx,%r9,1)
  0x000001cdf60c9af0:   vmovdqu %ymm0,0x30(%rcx,%r9,1)
  0x000001cdf60c9af7:   vmovdqu %ymm0,0x50(%rcx,%r9,1)
  0x000001cdf60c9afe:   vmovdqu %ymm0,0x70(%rcx,%r9,1)
  0x000001cdf60c9b05:   vmovdqu %ymm0,0x90(%rcx,%r9,1)
  0x000001cdf60c9b0f:   vmovdqu %ymm0,0xb0(%rcx,%r9,1)
  0x000001cdf60c9b19:   vmovdqu %ymm0,0xd0(%rcx,%r9,1)
  0x000001cdf60c9b23:   vmovdqu %ymm0,0xf0(%rcx,%r9,1)
  0x000001cdf60c9b2d:   vmovdqu %ymm0,0x110(%rcx,%r9,1)
  0x000001cdf60c9b37:   vmovdqu %ymm0,0x130(%rcx,%r9,1)
  0x000001cdf60c9b41:   vmovdqu %ymm0,0x150(%rcx,%r9,1)
  0x000001cdf60c9b4b:   vmovdqu %ymm0,0x170(%rcx,%r9,1)
  0x000001cdf60c9b55:   vmovdqu %ymm0,0x190(%rcx,%r9,1)
  0x000001cdf60c9b5f:   vmovdqu %ymm0,0x1b0(%rcx,%r9,1)
  0x000001cdf60c9b69:   vmovdqu %ymm0,0x1d0(%rcx,%r9,1)
  0x000001cdf60c9b73:   vmovdqu %ymm0,0x1f0(%rcx,%r9,1)
  0x000001cdf60c9b7d:   mov    %r8d,%r9d
  0x000001cdf60c9b80:   add    $0x200,%r9d
  0x000001cdf60c9b87:   cmp    %r10d,%r9d
  0x000001cdf60c9b8a:   jl     0x000001cdf60c9ae0

您使用相对的 put 方法。它不仅在 ByteBuffer 中设置一个字节，还更新了 position 字段。请注意，上述向量化循环不会在内存中更新位置。JVM 在循环后只设置一次 - 只要没有人能观察到缓冲区的不一致状态，就可以这样做。
现在尝试在填充之前发布 ByteBuffer：

Bigish tmp = new Bigish();
volatileField = tmp;  // publish
tmp.setup(blackhole);
tmp.fill((byte)random.nextInt(255));

循环优化出现问题；现在数组字节逐个填充，位置字段相应地增加。

  0x000001829b18ca5c:   nopl   0x0(%rax)
  0x000001829b18ca60:   cmp    %r11d,%esi
  0x000001829b18ca63:   jge    0x000001829b18ce34           ;*if_icmplt {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.nio.Buffer::nextPutIndex@10 (line 721)
                                                            ; - java.nio.HeapByteBuffer::put@6 (line 209)
                                                            ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca69:   mov    %esi,%ecx
  0x000001829b18ca6b:   add    %edx,%ecx                    ;*getfield position {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.nio.Buffer::nextPutIndex@1 (line 720)
                                                            ; - java.nio.HeapByteBuffer::put@6 (line 209)
                                                            ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca6d:   mov    %esi,%eax
  0x000001829b18ca6f:   inc    %eax                         ;*iinc {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - bench.MemPressureTest$Bigish::fill@26 (line 32)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca71:   mov    %eax,0x18(%r10)              ;*putfield position {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.nio.Buffer::nextPutIndex@25 (line 723)
                                                            ; - java.nio.HeapByteBuffer::put@6 (line 209)
                                                            ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca75:   cmp    %r8d,%ecx
  0x000001829b18ca78:   jae    0x000001829b18ce14
  0x000001829b18ca7e:   movslq %esi,%r9
  0x000001829b18ca81:   add    %r14,%r9
  0x000001829b18ca84:   mov    %bl,0x10(%rdi,%r9,1)         ;*bastore {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - java.nio.HeapByteBuffer::put@13 (line 209)
                                                            ; - bench.MemPressureTest$Bigish::fill@22 (line 33)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca89:   cmp    $0x1000,%eax
  0x000001829b18ca8f:   jge    0x000001829b18ca95           ;*if_icmpge {reexecute=0 rethrow=0 return_oop=0}
                                                            ; - bench.MemPressureTest$Bigish::fill@14 (line 32)
                                                            ; - bench.MemPressureTest::test1@28 (line 47)
  0x000001829b18ca91:   mov    %eax,%esi
  0x000001829b18ca93:   jmp    0x000001829b18ca5c

在test2中就是发生这种情况。由于ByteBuffer对象在编译范围之外，JIT无法像本地未发布的对象那样自由优化。

在使用外部缓冲区的情况下，是否有可能优化填充循环呢？

好消息是，这是可能的。只需使用绝对put方法代替相对方法。在这种情况下，position字段保持不变，JIT可以轻松向量化循环，而不会破坏ByteBuffer的不变性。

for (int i = 0; i < SIZE; ++i) {
    b.put(i, bt);
}

有了这个更改，循环将在两种情况下进行向量化。更好的是，现在test2比test1快得多，证明对象创建确实具有性能开销。

Benchmark               Mode  Cnt      Score     Error   Units
MemPressureTest.test1  thrpt   10   2447,370 ± 146,804  ops/ms
MemPressureTest.test2  thrpt   10  15677,575 ± 136,075  ops/ms

结论

在编译范围内创建ByteBuffer对象时，JVM无法将填充循环向量化，这导致了直觉上的性能差异。
尽可能使用绝对get/put方法而不是相对方法。因为绝对方法通常更快，由于它们不更新ByteBuffer的内部状态，JIT可以应用更激进的优化。
修改后的基准测试表明，对象创建确实具有开销。