我有一个包装器,用于BufferedReader
,它按顺序读取文件以在多个文件之间创建不间断的流:
import java.io.BufferedReader;
import java.io.FileInputStream;
import java.io.FileReader;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.io.Reader;
import java.util.ArrayList;
import java.util.zip.GZIPInputStream;
/**
* reads in a whole bunch of files such that when one ends it moves to the
* next file.
*
* @author isaak
*
*/
class LogFileStream implements FileStreamInterface{
private ArrayList<String> fileNames;
private BufferedReader br;
private boolean done = false;
/**
*
* @param files an array list of files to read from, order matters.
* @throws IOException
*/
public LogFileStream(ArrayList<String> files) throws IOException {
fileNames = new ArrayList<String>();
for (int i = 0; i < files.size(); i++) {
fileNames.add(files.get(i));
}
setFile();
}
/**
* advances the file that this class is reading from.
*
* @throws IOException
*/
private void setFile() throws IOException {
if (fileNames.size() == 0) {
this.done = true;
return;
}
if (br != null) {
br.close();
}
//if the file is a .gz file do a little extra work.
//otherwise read it in with a standard file Reader
//in either case, set the buffer size to 128kb
if (fileNames.get(0).endsWith(".gz")) {
InputStream fileStream = new FileInputStream(fileNames.get(0));
InputStream gzipStream = new GZIPInputStream(fileStream);
// TODO this probably needs to be modified to work well on any
// platform, UTF-8 is standard for debian/novastar though.
Reader decoder = new InputStreamReader(gzipStream, "UTF-8");
// note that the buffer size is set to 128kb instead of the standard
// 8kb.
br = new BufferedReader(decoder, 131072);
fileNames.remove(0);
} else {
FileReader filereader = new FileReader(fileNames.get(0));
br = new BufferedReader(filereader, 131072);
fileNames.remove(0);
}
}
/**
* returns true if there are more lines available to read.
* @return true if there are more lines available to read.
*/
public boolean hasMore() {
return !done;
}
/**
* Gets the next line from the correct file.
* @return the next line from the files, if there isn't one it returns null
* @throws IOException
*/
public String nextLine() throws IOException {
if (done == true) {
return null;
}
String line = br.readLine();
if (line == null) {
setFile();
return nextLine();
}
return line;
}
}
如果我在一个大文件列表(300MB的文件)上构建此对象,然后在while循环中一遍又一遍地打印
nextLine()
,性能将不断下降,直到没有更多的RAM可用。即使我读入的文件大小约为500kb,并使用了具有32MB内存的虚拟机,也会发生这种情况。我希望这段代码能够运行在极大的数据集上(数百GB的文件),并且它是程序的一个组件,需要在32MB或更少的内存下运行。
使用的文件大多标记为CSV文件,因此在磁盘上使用Gzip进行压缩。该阅读器需要处理gzip和未压缩的文件。
如果我理解正确,一旦文件已经被读取并输出其行数据,那么与该文件相关的对象和其他所有内容都应该可以进行垃圾回收,对吗?
C++
有关吗? - GalikfileNames.addAll(files);
。 - Kayaman