比较直接和非直接 ByteBuffer 的 get/put 操作

Question

比较直接和非直接 ByteBuffer 的 get/put 操作

javamemoryniobytebuffer

12

非直接字节缓冲区的get/put是否比直接字节缓冲区更快？

如果我必须从直接字节缓冲区中读取/写入，那么先将其读取/写入到线程本地字节数组中，然后使用字节数组完整地更新（对于写入而言）直接字节缓冲区会更好吗？

- user882659

2个回答

2

直接缓冲区将数据保存在JNI中，所以get()和put()需要跨越JNI边界。非直接缓冲区将数据保存在JVM中。

因此： 1. 如果您在Java中根本没有使用数据（如仅将通道复制到另一个通道），则直接缓冲区更快，因为数据根本不必跨越JNI边界。 2. 相反，如果您在Java中使用数据，则非直接缓冲区将更快。是否显著取决于有多少数据必须跨越JNI边界，以及每次传输的基本单位是什么。例如，从/向直接缓冲区逐个获取或放置单个字节可能非常昂贵，而一次获取/放置16384字节将大大摊销JNI边界成本。

对于第二段，我将使用本地byte[]数组，而不是线程本地。但是，如果我在Java中使用数据，则根本不会使用直接字节缓冲区。正如Javadoc所说，应仅在直接字节缓冲区提供可衡量的性能优势时使用它们。

- user207421

感谢，我的消息大小通常为256个字节，我想要将其写入套接字。我考虑将字节编码到线程本地的byte[]数组中，然后将字节数组复制到直接ByteBufffer，再将直接ByteBufffer传递给套接字通道进行写入。 - user882659

直接字节缓冲区被池化了。这样做是否更好，或者您建议将消息直接编码到直接字节缓冲区中，而不是使用临时字节数组？ - user882659

@user882659 请看编辑。在这种情况下，使用直接缓冲区没有任何好处。 - user207421

请参考我在“6月24日6:09”的评论。显然，在某个时候会发生JNI来执行实际的I/O操作。但是没有证据表明get或put正在进行JNI调用以读取/写入缓冲区中的数据。还有其他方法可以做到这一点。这两个语句都不与此相矛盾。 - Stephen C

get/put不受边界限制。它是编译成本地代码的，类似于memcpy（arr，＆value，sizeof（value））; JIT后，它的速度就像汇编语言一样快。在2015年，不需要JNI。 - Kr0e

显示剩余10条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Peter Lawrey · Accepted Answer

如果你比较非直接字节缓冲区和直接字节缓冲区，而且它们不使用本机字节顺序（大多数系统都是小端的，直接 ByteBuffer 的默认顺序是大端），性能非常相似。如果你使用本机排序的字节缓冲区，则对于多字节值，性能可以显着提高。对于 byte，无论你做什么，差别都不大。在 HotSpot/OpenJDK 中，ByteBuffer 使用 Unsafe 类，并且许多 native 方法被视为 intrinsics。这取决于 JVM，据我所知，Android VM 在最近版本中将其视为内置函数。如果你转储生成的汇编代码，你可以看到 Unsafe 中的内置函数被转换为一个机器代码指令。即它们没有 JNI 调用的开销。

实际上，如果您喜欢微调，您可能会发现大多数ByteBuffer getXxxx或setXxxx的时间都花在边界检查上，而不是实际的内存访问。因此，我仍然直接使用Unsafe以获得最大的性能（注：这是Oracle不鼓励的）。

“如果我必须从直接字节缓冲区读取/写入，那么最好先将其读取/写入线程本地字节数组，然后使用字节数组完全更新（对于写入）直接字节缓冲区吗？”

我真不想看到比这更好的东西了。;) 这听起来非常复杂。

通常，最简单的解决方案更好、更快。

你可以使用这段代码自行测试。

public static void main(String... args) {
    ByteBuffer bb1 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    ByteBuffer bb2 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    for (int i = 0; i < 10; i++)
        runTest(bb1, bb2);
}

private static void runTest(ByteBuffer bb1, ByteBuffer bb2) {
    bb1.clear();
    bb2.clear();
    long start = System.nanoTime();
    int count = 0;
    while (bb2.remaining() > 0)
        bb2.putInt(bb1.getInt());
    long time = System.nanoTime() - start;
    int operations = bb1.capacity() / 4 * 2;
    System.out.printf("Each putInt/getInt took an average of %.1f ns%n", (double) time / operations);
}

打印

Each putInt/getInt took an average of 83.9 ns
Each putInt/getInt took an average of 1.4 ns
Each putInt/getInt took an average of 34.7 ns
Each putInt/getInt took an average of 1.3 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.3 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.2 ns
Each putInt/getInt took an average of 1.2 ns

我非常确定JNI调用所需的时间超过1.2纳秒。

为了证明延迟不是由"JNI"调用造成的，而是围绕它的内容。您可以直接使用Unsafe编写相同的循环。

public static void main(String... args) {
    ByteBuffer bb1 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    ByteBuffer bb2 = ByteBuffer.allocateDirect(256 * 1024).order(ByteOrder.nativeOrder());
    for (int i = 0; i < 10; i++)
        runTest(bb1, bb2);
}

private static void runTest(ByteBuffer bb1, ByteBuffer bb2) {
    Unsafe unsafe = getTheUnsafe();
    long start = System.nanoTime();
    long addr1 = ((DirectBuffer) bb1).address();
    long addr2 = ((DirectBuffer) bb2).address();
    for (int i = 0, len = Math.min(bb1.capacity(), bb2.capacity()); i < len; i += 4)
        unsafe.putInt(addr1 + i, unsafe.getInt(addr2 + i));
    long time = System.nanoTime() - start;
    int operations = bb1.capacity() / 4 * 2;
    System.out.printf("Each putInt/getInt took an average of %.1f ns%n", (double) time / operations);
}

public static Unsafe getTheUnsafe() {
    try {
        Field theUnsafe = Unsafe.class.getDeclaredField("theUnsafe");
        theUnsafe.setAccessible(true);
        return (Unsafe) theUnsafe.get(null);
    } catch (Exception e) {
        throw new AssertionError(e);
    }
}

打印

Each putInt/getInt took an average of 40.4 ns
Each putInt/getInt took an average of 44.4 ns
Each putInt/getInt took an average of 0.4 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns
Each putInt/getInt took an average of 0.3 ns

因此，您可以看到native调用比您对JNI调用的期望要快得多。这种延迟的主要原因可能是L2缓存速度。 ;)

所有内容都在i3 3.3 GHz上运行。