在Java中逐行读写大文件的最快方法

Question

在Java中逐行读写大文件的最快方法

javaperformancefile-iobufferedreader

22

我一直在寻找在Java中快速读写大文件（0.5 - 1 GB）的最佳方法，同时内存有限（约为64MB）。文件中的每行代表一条记录，因此我需要逐行获取它们。该文件是一个普通文本文件。

我尝试使用BufferedReader和BufferedWriter，但似乎并不是最佳选项。仅进行读写没有处理的情况下，读取和写入0.5GB大小的文件大约需要35秒。我认为这里的瓶颈是写入，因为仅读取需要约10秒。

我尝试过读取字节数组，但是在每个读取的数组中查找行需要更多时间。

请问有什么建议吗？谢谢

- user1785771

1

请参阅最佳读取文本文件的方法。 - DNA

可能重复：https://dev59.com/iHNA5IYBdhLWcg3wL6oc - aphex

你在这些文件中使用的编码是什么？你的系统默认的字符集是什么？ - Donal Fellows

6个回答

9

我首先会尝试增加BufferedReader和BufferedWriter的缓冲区大小。默认的缓冲区大小没有文档记录，但至少在Oracle VM中它们是8192个字符，这不会带来太多性能优势。

如果你只需要复制文件（而不需要实际访问数据），我建议放弃Reader / Writer方法，直接使用InputStream和OutputStream，并使用字节数组作为缓冲区：

FileInputStream fis = new FileInputStream("d:/test.txt");
FileOutputStream fos = new FileOutputStream("d:/test2.txt");
byte[] b = new byte[bufferSize];
int r;
while ((r=fis.read(b))>=0) {
    fos.write(b, 0, r);         
}
fis.close();
fos.close();

或者实际上使用NIO：

FileChannel in = new RandomAccessFile("d:/test.txt", "r").getChannel();
FileChannel out = new RandomAccessFile("d:/test2.txt", "rw").getChannel();
out.transferFrom(in, 0, Long.MAX_VALUE);
in.close();
out.close();

在对不同的复制方法进行基准测试时，我发现每次运行基准测试之间的差异（持续时间）比不同实现之间的差异要大得多。I/O缓存（包括操作系统级别和硬盘缓存）在这里发挥了很大的作用，很难说哪种更快。在我的硬件上，使用BufferedReader和BufferedWriter逐行复制一个1GB的文本文件，在某些运行中只需要不到5秒钟，而在其他运行中需要超过30秒。

- jarnbjo

谢谢你的建议，但是我内存有限，无法使用FileChannel方法。 - user1785771

为什么不行呢？可用内存与使用FileChannel有什么关系？ - jarnbjo

实际上，在复制文件之前，我需要对其进行处理。 - user1785771

那么，为什么你写下你不处理文件（只读/写，不处理）的内容呢？ - jarnbjo

4

我已经写了一篇关于在Java中读取文件的多种方式的广泛文章，并使用从1KB到1GB的样本文件相互测试它们，我发现以下3种方法是读取1GB文件最快的：

1）java.nio.file.Files.readAllBytes() - 读取1GB测试文件只需不到1秒钟。

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;

public class ReadFile_Files_ReadAllBytes {
  public static void main(String [] pArgs) throws IOException {
    String fileName = "c:\\temp\\sample-10KB.txt";
    File file = new File(fileName);

    byte [] fileBytes = Files.readAllBytes(file.toPath());
    char singleChar;
    for(byte b : fileBytes) {
      singleChar = (char) b;
      System.out.print(singleChar);
    }
  }
}

2) java.nio.file.Files.lines() - 在读取一个1GB的测试文件时，大约需要3.5秒钟。

import java.io.File;
import java.io.IOException;
import java.nio.file.Files;
import java.util.stream.Stream;

public class ReadFile_Files_Lines {
  public static void main(String[] pArgs) throws IOException {
    String fileName = "c:\\temp\\sample-10KB.txt";
    File file = new File(fileName);

    try (Stream linesStream = Files.lines(file.toPath())) {
      linesStream.forEach(line -&gt; {
        System.out.println(line);
      });
    }
  }
}

3) java.io.BufferedReader - 读取一个1GB的测试文件大约需要4.5秒。

import java.io.BufferedReader;
import java.io.FileReader;
import java.io.IOException;

public class ReadFile_BufferedReader_ReadLine {
  public static void main(String [] args) throws IOException {
    String fileName = "c:\\temp\\sample-10KB.txt";
    FileReader fileReader = new FileReader(fileName);

    try (BufferedReader bufferedReader = new BufferedReader(fileReader)) {
      String line;
      while((line = bufferedReader.readLine()) != null) {
        System.out.println(line);
      }
    }
  }
}

- gomisha

那篇文章仍然很棒，你有没有想过如果查看内存消耗，那些前三名是如何做到的？是否有显着差异，如果只有少量内存可用，应该使用哪个？ - Jokkeri

4

在Java 7中，您可以使用Files.readAllLines()和Files.write()方法。以下是示例：

List<String> readTextFile(String fileName) throws IOException {
    Path path = Paths.get(fileName);
    return Files.readAllLines(path, StandardCharsets.UTF_8);
}

void writeTextFile(List<String> strLines, String fileName) throws IOException {
    Path path = Paths.get(fileName);
    Files.write(path, strLines, StandardCharsets.UTF_8);
}

- Oleg K

1

这篇内容主要讲述的是如何通过Scanner类迭代器高效地处理OutOfMemoryException异常。它按行读取文件，而不是一次性读取整个文件。

以下代码可解决此问题：

try(FileInputStream inputStream =new FileInputStream("D:\\File\\test.txt");
  Scanner sc= new Scanner(inputStream, "UTF-8")) {
  while (sc.hasNextLine()) {
    String line = sc.nextLine();
    System.out.println(line);
  }
} catch (IOException e) {
  e.printStackTrace();
}

- vkstream

1

我建议查看java.nio包中的类。非阻塞IO对于套接字可能更快：

http://docs.oracle.com/javase/6/docs/api/java/nio/package-summary.html

这篇文章有基准测试数据证明它是真实的：

http://vanillajava.blogspot.com/2010/07/java-nio-is-faster-than-java-io-for.html

- duffymo

我研究了nio，但它只允许从文件中读取数组或缓冲区。处理这个数组以提取行需要更长的时间。 - user1785771

这篇文章有一张图表，但我找不到实际测量的内容。我唯一看到NIO性能优势的情况是在使用直接字节缓冲区在NIO通道之间复制数据时。在这种情况下，从Java代码访问数据的速度要慢得多。 - jarnbjo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Peter Lawrey · Accepted Answer

我猜想你真正的问题是硬件资源有限，所以软件上的改变不会有太大影响。如果你有足够的内存和CPU，那么一些高级技巧可以帮助提升效率，但如果你只是因为文件没有被缓存而等待硬盘读取，那么改变软件也不会有太大作用。

此外，HDD 的读取速度通常为每秒 50 MB 或 500 MB 在 10 秒内。

尝试运行以下命令，看看你的系统在哪个点无法高效缓存文件。

public static void main(String... args) throws IOException {
    for (int mb : new int[]{50, 100, 250, 500, 1000, 2000})
        testFileSize(mb);
}

private static void testFileSize(int mb) throws IOException {
    File file = File.createTempFile("test", ".txt");
    file.deleteOnExit();
    char[] chars = new char[1024];
    Arrays.fill(chars, 'A');
    String longLine = new String(chars);
    long start1 = System.nanoTime();
    PrintWriter pw = new PrintWriter(new FileWriter(file));
    for (int i = 0; i < mb * 1024; i++)
        pw.println(longLine);
    pw.close();
    long time1 = System.nanoTime() - start1;
    System.out.printf("Took %.3f seconds to write to a %d MB, file rate: %.1f MB/s%n",
            time1 / 1e9, file.length() >> 20, file.length() * 1000.0 / time1);

    long start2 = System.nanoTime();
    BufferedReader br = new BufferedReader(new FileReader(file));
    for (String line; (line = br.readLine()) != null; ) {
    }
    br.close();
    long time2 = System.nanoTime() - start2;
    System.out.printf("Took %.3f seconds to read to a %d MB file, rate: %.1f MB/s%n",
            time2 / 1e9, file.length() >> 20, file.length() * 1000.0 / time2);
    file.delete();
}

在一台内存很大的Linux机器上。

Took 0.395 seconds to write to a 50 MB, file rate: 133.0 MB/s
Took 0.375 seconds to read to a 50 MB file, rate: 140.0 MB/s
Took 0.669 seconds to write to a 100 MB, file rate: 156.9 MB/s
Took 0.569 seconds to read to a 100 MB file, rate: 184.6 MB/s
Took 1.585 seconds to write to a 250 MB, file rate: 165.5 MB/s
Took 1.274 seconds to read to a 250 MB file, rate: 206.0 MB/s
Took 2.513 seconds to write to a 500 MB, file rate: 208.8 MB/s
Took 2.332 seconds to read to a 500 MB file, rate: 225.1 MB/s
Took 5.094 seconds to write to a 1000 MB, file rate: 206.0 MB/s
Took 5.041 seconds to read to a 1000 MB file, rate: 208.2 MB/s
Took 11.509 seconds to write to a 2001 MB, file rate: 182.4 MB/s
Took 9.681 seconds to read to a 2001 MB file, rate: 216.8 MB/s

在一台内存较大的 Windows 计算机上。

Took 0.376 seconds to write to a 50 MB, file rate: 139.7 MB/s
Took 0.401 seconds to read to a 50 MB file, rate: 131.1 MB/s
Took 0.517 seconds to write to a 100 MB, file rate: 203.1 MB/s
Took 0.520 seconds to read to a 100 MB file, rate: 201.9 MB/s
Took 1.344 seconds to write to a 250 MB, file rate: 195.4 MB/s
Took 1.387 seconds to read to a 250 MB file, rate: 189.4 MB/s
Took 2.368 seconds to write to a 500 MB, file rate: 221.8 MB/s
Took 2.454 seconds to read to a 500 MB file, rate: 214.1 MB/s
Took 4.985 seconds to write to a 1001 MB, file rate: 210.7 MB/s
Took 5.132 seconds to read to a 1001 MB file, rate: 204.7 MB/s
Took 10.276 seconds to write to a 2003 MB, file rate: 204.5 MB/s
Took 9.964 seconds to read to a 2003 MB file, rate: 210.9 MB/s