在Java中将大量数据写入文本文件的最快方法

74

我需要将大量数据写入文本[csv]文件。我使用了BufferedWriter来写入数据,它花费了约40秒钟来写入174 MB的数据。这是Java可以提供的最快速度吗?

bufferedWriter = new BufferedWriter ( new FileWriter ( "fileName.csv" ) );

注意:这40秒包括从结果集中迭代和获取记录的时间。:)。174 MB是结果集中的400000行。


6
你不会在运行这段代码的机器上开启了反病毒软件吧? - Thorbjørn Ravn Andersen
7个回答

109

你可以尝试移除BufferedWriter,直接使用FileWriter。在现代系统上,很可能会直接将数据写入到驱动器的缓存内存中。

在我的电脑上(一台双核2.4GHz的戴尔电脑,运行Windows XP,拥有一个80GB、7200转/分的日立硬盘),写入175MB(即4百万个字符串)需要大约4-5秒钟的时间。

你能否确定记录检索和文件写入分别占用了多少时间?

import java.io.BufferedWriter;
import java.io.File;
import java.io.FileWriter;
import java.io.IOException;
import java.io.Writer;
import java.util.ArrayList;
import java.util.List;

public class FileWritingPerfTest {
    

private static final int ITERATIONS = 5;
private static final double MEG = (Math.pow(1024, 2));
private static final int RECORD_COUNT = 4000000;
private static final String RECORD = "Help I am trapped in a fortune cookie factory\n";
private static final int RECSIZE = RECORD.getBytes().length;

public static void main(String[] args) throws Exception {
    List<String> records = new ArrayList<String>(RECORD_COUNT);
    int size = 0;
    for (int i = 0; i < RECORD_COUNT; i++) {
        records.add(RECORD);
        size += RECSIZE;
    }
    System.out.println(records.size() + " 'records'");
    System.out.println(size / MEG + " MB");
    
    for (int i = 0; i < ITERATIONS; i++) {
        System.out.println("\nIteration " + i);
        
        writeRaw(records);
        writeBuffered(records, 8192);
        writeBuffered(records, (int) MEG);
        writeBuffered(records, 4 * (int) MEG);
    }
}

private static void writeRaw(List<String> records) throws IOException {
    File file = File.createTempFile("foo", ".txt");
    try {
        FileWriter writer = new FileWriter(file);
        System.out.print("Writing raw... ");
        write(records, writer);
    } finally {
        // comment this out if you want to inspect the files afterward
        file.delete();
    }
}

private static void writeBuffered(List<String> records, int bufSize) throws IOException {
    File file = File.createTempFile("foo", ".txt");
    try {
        FileWriter writer = new FileWriter(file);
        BufferedWriter bufferedWriter = new BufferedWriter(writer, bufSize);
    
        System.out.print("Writing buffered (buffer size: " + bufSize + ")... ");
        write(records, bufferedWriter);
    } finally {
        // comment this out if you want to inspect the files afterward
        file.delete();
    }
}

private static void write(List<String> records, Writer writer) throws IOException {
    long start = System.currentTimeMillis();
    for (String record: records) {
        writer.write(record);
    }
    // writer.flush(); // close() should take care of this
    writer.close(); 
    long end = System.currentTimeMillis();
    System.out.println((end - start) / 1000f + " seconds");
}
}

2
@rozario 每个写入调用应该只产生约175MB的数据,然后删除自身。否则,你最终会得到3.5GB的数据,即175MB x 4个不同的写入调用x 5次迭代。你可以检查file.delete()的返回值,如果是false,就抛出一个异常。 - David Moles
请注意,在这种情况下,writer.flush()不是必需的,因为writer.close()会隐式地刷新内存。顺便说一句:最佳实践建议使用try resource close而不是显式调用close() - patryk.beza
3
顺便说一句,这是写给Java 5的,该版本至少没有记录在关闭时进行刷新,并且没有try-with-resources。它可能需要更新。 - David Moles
2
我刚查了一下Java 1.1的Writer.flush()文档,上面写着“关闭流之前先刷新它”。所以在调用close()之前调用flush()是不必要的。另外,BufferedWriter可能无用的原因之一是,FileWriter作为OutputStreamWriter的一个特化,需要在进行从字符序列到目标编码的字节序列转换时拥有自己的缓冲区。在前端拥有更多的缓冲区并不能帮助字符集编码器以更高的速率刷新其较小的字节缓冲区。 - Holger
1
确实如此,但是有关额外缓冲区的实际影响以及如何决定是否使用它,文档或教程中从未得到很好的解决(据我所知)。请注意,NIO API甚至没有针对所有通道类型的“Buffered…”对应项。 - Holger
显示剩余8条评论

43

尝试使用内存映射文件(在我的机器上,核心2双核,2.5GB RAM,需要300毫秒来写入174MB):

byte[] buffer = "Help I am trapped in a fortune cookie factory\n".getBytes();
int number_of_lines = 400000;

FileChannel rwChannel = new RandomAccessFile("textfile.txt", "rw").getChannel();
ByteBuffer wrBuf = rwChannel.map(FileChannel.MapMode.READ_WRITE, 0, buffer.length * number_of_lines);
for (int i = 0; i < number_of_lines; i++)
{
    wrBuf.put(buffer);
}
rwChannel.close();

当您实例化ByteBuffer时,aMessage.length()代表什么意思? - Hotel
3
只是提醒一下,这是在MacBook Pro上(2013年末款),2.6 GHz的Core i7处理器和苹果1TB固态硬盘上运行,对于一个包含4百万行的185兆文件,大约需要140毫秒。 - Egwor
@JerylCook 当你知道确切的大小时,内存映射非常有用。在这里,我们预先保留了一个缓冲区*文件数的空间。 - Deepak Agarwal
谢谢!我能用它来处理超过2GB的文件吗? MappedByteBuffer map(MapMode var1, long var2, long var4): 如果var4大于2147483647L,则会抛出IllegalArgumentException异常:“大小超过Integer.MAX_VALUE” - Mikhail Ionkin
多么神奇的方法,在戴尔核心i5(1.6,2.3)GHz上只用了105毫秒。 - FSm

20

仅供统计目的:

该机器是旧戴尔电脑,但配备了新的SSD硬盘

CPU:英特尔 Pentium D 2.8 Ghz

SSD:Patriot Inferno 120GB SSD

4000000 'records'
175.47607421875 MB

Iteration 0
Writing raw... 3.547 seconds
Writing buffered (buffer size: 8192)... 2.625 seconds
Writing buffered (buffer size: 1048576)... 2.203 seconds
Writing buffered (buffer size: 4194304)... 2.312 seconds

Iteration 1
Writing raw... 2.922 seconds
Writing buffered (buffer size: 8192)... 2.406 seconds
Writing buffered (buffer size: 1048576)... 2.015 seconds
Writing buffered (buffer size: 4194304)... 2.282 seconds

Iteration 2
Writing raw... 2.828 seconds
Writing buffered (buffer size: 8192)... 2.109 seconds
Writing buffered (buffer size: 1048576)... 2.078 seconds
Writing buffered (buffer size: 4194304)... 2.015 seconds

Iteration 3
Writing raw... 3.187 seconds
Writing buffered (buffer size: 8192)... 2.109 seconds
Writing buffered (buffer size: 1048576)... 2.094 seconds
Writing buffered (buffer size: 4194304)... 2.031 seconds

Iteration 4
Writing raw... 3.093 seconds
Writing buffered (buffer size: 8192)... 2.141 seconds
Writing buffered (buffer size: 1048576)... 2.063 seconds
Writing buffered (buffer size: 4194304)... 2.016 seconds

我们可以看到,未缓冲的方法比缓冲的方法慢。


2
然而,当文本的大小变大时,缓冲方法变得更慢。 - FSm

5
您的传输速度可能不会受到Java的限制。相反,我怀疑以下几点(没有特定顺序):
  1. 从数据库传输的速度
  2. 传输到磁盘的速度
如果您读取完整数据集,然后将其写入磁盘,那么这将需要更长时间,因为JVM将必须分配内存,并且db读取/磁盘写入将按顺序发生。相反,我建议您每次从数据库中进行读取时都将其写入缓冲区编写器,因此操作将更加并发(我不知道您是否正在这样做)。

4

3

package all.is.well;
import java.io.IOException;
import java.io.RandomAccessFile;
import java.util.concurrent.ExecutorService;
import java.util.concurrent.Executors;
import junit.framework.TestCase;

/**
 * @author Naresh Bhabat
 * 
Following  implementation helps to deal with extra large files in java.
This program is tested for dealing with 2GB input file.
There are some points where extra logic can be added in future.


Pleasenote: if we want to deal with binary input file, then instead of reading line,we need to read bytes from read file object.



It uses random access file,which is almost like streaming API.


 * ****************************************
Notes regarding executor framework and its readings.
Please note :ExecutorService executor = Executors.newFixedThreadPool(10);

 *      for 10 threads:Total time required for reading and writing the text in
 *         :seconds 349.317
 * 
 *         For 100:Total time required for reading the text and writing   : seconds 464.042
 * 
 *         For 1000 : Total time required for reading and writing text :466.538 
 *         For 10000  Total time required for reading and writing in seconds 479.701
 *
 * 
 */
public class DealWithHugeRecordsinFile extends TestCase {

 static final String FILEPATH = "C:\\springbatch\\bigfile1.txt.txt";
 static final String FILEPATH_WRITE = "C:\\springbatch\\writinghere.txt";
 static volatile RandomAccessFile fileToWrite;
 static volatile RandomAccessFile file;
 static volatile String fileContentsIter;
 static volatile int position = 0;

 public static void main(String[] args) throws IOException, InterruptedException {
  long currentTimeMillis = System.currentTimeMillis();

  try {
   fileToWrite = new RandomAccessFile(FILEPATH_WRITE, "rw");//for random write,independent of thread obstacles 
   file = new RandomAccessFile(FILEPATH, "r");//for random read,independent of thread obstacles 
   seriouslyReadProcessAndWriteAsynch();

  } catch (IOException e) {
   // TODO Auto-generated catch block
   e.printStackTrace();
  }
  Thread currentThread = Thread.currentThread();
  System.out.println(currentThread.getName());
  long currentTimeMillis2 = System.currentTimeMillis();
  double time_seconds = (currentTimeMillis2 - currentTimeMillis) / 1000.0;
  System.out.println("Total time required for reading the text in seconds " + time_seconds);

 }

 /**
  * @throws IOException
  * Something  asynchronously serious
  */
 public static void seriouslyReadProcessAndWriteAsynch() throws IOException {
  ExecutorService executor = Executors.newFixedThreadPool(10);//pls see for explanation in comments section of the class
  while (true) {
   String readLine = file.readLine();
   if (readLine == null) {
    break;
   }
   Runnable genuineWorker = new Runnable() {
    @Override
    public void run() {
     // do hard processing here in this thread,i have consumed
     // some time and eat some exception in write method.
     writeToFile(FILEPATH_WRITE, readLine);
     // System.out.println(" :" +
     // Thread.currentThread().getName());

    }
   };
   executor.execute(genuineWorker);
  }
  executor.shutdown();
  while (!executor.isTerminated()) {
  }
  System.out.println("Finished all threads");
  file.close();
  fileToWrite.close();
 }

 /**
  * @param filePath
  * @param data
  * @param position
  */
 private static void writeToFile(String filePath, String data) {
  try {
   // fileToWrite.seek(position);
   data = "\n" + data;
   if (!data.contains("Randomization")) {
    return;
   }
   System.out.println("Let us do something time consuming to make this thread busy"+(position++) + "   :" + data);
   System.out.println("Lets consume through this loop");
   int i=1000;
   while(i>0){
   
    i--;
   }
   fileToWrite.write(data.getBytes());
   throw new Exception();
  } catch (Exception exception) {
   System.out.println("exception was thrown but still we are able to proceeed further"
     + " \n This can be used for marking failure of the records");
   //exception.printStackTrace();

  }

 }
}


1
请添加一些文本来解释为什么这个答案比其他答案更好。在代码中加入注释是不够的。 - Benjamin Lowry
这个方案更好的原因是:它是一个实时场景,并且是一个工作状态的示例。它的其他好处是,它可以异步地进行读取、处理和写入...它使用高效的Java API(即)随机访问文件,它是线程安全的,多个线程可以同时读取和写入。它不会在运行时导致内存开销,也不会使系统崩溃...它是一种多功能解决方案,用于处理记录处理失败,可以在相应的线程中跟踪。如果需要更多帮助,请告诉我。 - RAM
2
谢谢,这是你的帖子需要的信息。也许考虑将其添加到帖子正文中 :) - Benjamin Lowry
3
如果用10个线程写入2GB数据需要349.317秒,那么除非你指的是毫秒,否则可能是写入大量数据最慢的方式之一。 - Deepak Agarwal

0

对于那些想要提高检索记录并将其转储到文件中的人(即不对记录进行任何处理),可以将这些记录附加到StringBuffer中,而不是放入ArrayList中。使用toSring()函数获取单个字符串并一次性将其写入文件中。

对我来说,检索时间从22秒减少到17秒。


1
那只是一个创建一些虚假“记录”的示例 - 我会假设在现实世界中,这些记录来自其他地方(在 OP 的情况下是数据库)。但是,如果您需要先将所有内容读入内存,则 StringBuffer 可能会更快。原始的字符串数组(String[])也可能更快。 - David Moles
使用StringBuffer会浪费大量资源。大多数标准的Java编写器在内部使用StreamEncoder,并且它有自己的8192字节缓冲区。即使您创建一个包含所有数据的字符串,它也会作为块进行传输,并从字符编码为byte[]。最好的解决方案是实现自己的Writer,直接使用FileOutputStream的write(byte[])方法,该方法使用底层本机writeBytes方法。 - krishna T
就像@DavidMoles所说,数据的源格式在这种情况下也非常重要。如果数据已经以字节形式可用,则直接写入FileOutputSteam。 - krishna T

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接