寻找Java中BufferedInputStream的最佳大小

7

我正在对加载二进制文件的代码进行性能分析,加载时间大约为15秒。

绝大部分加载时间都来自于加载二进制数据的方法。

我有以下代码创建我的DataInputStream:

is = new DataInputStream(
     new GZIPInputStream(
     new FileInputStream("file.bin")));

我把它改成了这样:

is = new DataInputStream(
     new BufferedInputStream(
     new GZIPInputStream(
     new FileInputStream("file.bin"))));

所以,我进行了这个小修改后,加载代码的时间从15秒缩短到了4秒。

但是,我发现BufferedInputStream有两个构造函数。另一个构造函数可以让你显式地定义缓冲区大小。

我有两个问题:

  1. 在BufferedInputStream中选择了什么样的大小?它是否理想?如果不是,如何找到最佳缓冲区大小?我应该编写一小段代码来执行二进制搜索吗?
  2. 这是我能使用BufferedInputStream的最佳方式吗?最初我将其放置在GZIPInputStream中,但几乎没有什么好处。我假设现在代码正在做的是每当文件缓冲区需要被填充时,GZIP输入流会通过并解码x字节(其中x是缓冲区的大小)。是否值得完全省略GZIPInputStream?它肯定不是必需的,但使用它时我的文件大小大大减小。
2个回答

9

无论是GZIPInputStream还是BufferedInputStream都使用内部缓冲区。这就是为什么在GZIPInputStream中使用BufferedInputStream不提供任何好处的原因。 GZIPInputStream的问题在于它不缓冲所生成的输出,因此您当前的版本速度更快。

BufferedInputStream的默认缓冲区大小为8KB,因此您可以尝试增加或减小该大小以查看是否有帮助。我认为确切的数字并不重要,因此您可以简单地乘以或除以二。

如果文件很小,您还可以尝试完全缓冲它。理论上,这应该给您最佳的性能。您还可以尝试增加GZIPInputStream的缓冲区大小(默认为512字节),因为这可能会加速从磁盘读取。


我建议您在从磁盘读取时,尝试使用64K的缓冲区来处理GZIPInputStream。我使用1MB,这可能超出了实际需要的范围。;) - Peter Lawrey

4
  1. Don't bother with a coded binary search. Just try some values by hand and compare the timings (you can do a manual binary search if you like). You'll most likely find that a very wide range of buffer sizes will give you close-to-optimal performance, so pick the smallest that does the trick.

  2. What you have is the correct order:

    is = new DataInputStream(
         new BufferedInputStream(
         new GZIPInputStream(
         new FileInputStream("file.bin"))));
    

    There is little point in putting a BufferedInputStream inside the GZIPInputStream since the latter already buffers its input (but not the output.)

    Removing GZIPInputStream might be a win, but will most likely be detrimental to performance if the data has to be read from disk and is not resident in the filesystem cache. The reason is that reading from disk is very slow and decompressing gzip is very fast. Therefore it is generally cheaper to read less data from disk and decompress it in memory than it is to read more data from disk.


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接