Java读取长文本文件速度非常慢。

Question

Java读取长文本文件速度非常慢。

10

我有一个文本文件（使用XStream创建的XML），它有63000行（3.5 MB）长度。我正在尝试使用缓冲读取器读取它：

                BufferedReader br = new BufferedReader(new FileReader(file));
                try {
                    String s = "";
                    String tempString;
                    int i = 0;
                    while ((tempString = br.readLine()) != null) {
                        s = s.concat(tempString);
//                        s=s+tempString;
                        i = i + 1;
                        if (i % 1000 == 0) {
                            System.out.println(Integer.toString(i));
                        }
                    }
                    br.close();

在这里您可以看到我尝试测量阅读速度的结果。但它非常低。读取1000行后，再读取10000行需要几秒钟。很明显我做错了些什么，但是我不知道具体是哪里出问题了。感谢您提前提供的帮助。

- lozga

你的意图是解析这个文件吗？为什么不使用Xerces/SAX/其他解析工具直接加载它呢？ - Visionary Software Solutions

10

如果字符串很大，使用String的+和concat操作非常低效。建议使用StringBuilder，或者将InputStream/Reader直接传递给XML解析器。 - Paul Grime

或者如果你真的需要行，可以使用类似这样的东西 - http://commons.apache.org/proper/commons-io/javadocs/api-2.4/org/apache/commons/io/IOUtils.html#readLines%28java.io.Reader%29。 - Paul Grime

1

如果您需要在XStream中使用它，为什么不直接将读取器传递给XStream，而不是自己读取并传递字符串呢？ - Mark Rotteveel

哦！我错过了可以将文件输入到XStream .fromXML方法中的事实。非常感谢。 - lozga

显示剩余2条评论

4个回答

4

以下是您可以立即改进的一些方法：

使用StringBuilder而不是concat和+。在循环中使用+和concat可能会严重影响性能。
减少对磁盘的访问。您可以通过使用大型缓冲区来实现：

BufferedReader br = new BufferedReader(new FileReader("someFile.txt"), SIZE);

- Maroun

3

对于即使是小字符串，String连接非常缓慢，因此您应该使用StringBuilder。

此外，请尝试使用NIO而不是BufferedReader。

public static void main(String[] args) throws IOException {
    final File file = //some file
    try (final FileChannel fileChannel = new RandomAccessFile(file, "r").getChannel()) {
        final StringBuilder stringBuilder = new StringBuilder();
        final ByteBuffer byteBuffer = ByteBuffer.allocate(1024);
        final CharsetDecoder charsetDecoder = Charset.forName("UTF-8").newDecoder();
        while (fileChannel.read(byteBuffer) > 0) {
            byteBuffer.flip();
            stringBuilder.append(charsetDecoder.decode(byteBuffer));
            byteBuffer.clear();
        }
    }
}

如果速度仍然太慢，您可以调整缓冲区大小 - 缓冲区大小的最佳值与系统有很大关系。对我而言，缓冲区大小是1K还是4K几乎没有什么区别，但在其他系统上，我知道更改缓冲区大小可以将速度提高一个数量级。

- Boris the Spider

1

除了已经提到的内容，根据您对XML的使用情况，您的代码可能存在错误，因为它会丢弃换行符。例如，这段代码：

package temp.stackoverflow.q15849706;

import java.io.BufferedReader;
import java.io.IOException;
import java.io.InputStreamReader;
import java.net.URL;

import com.thoughtworks.xstream.XStream;

public class ReadXmlLines {
    public String read1(BufferedReader br) throws IOException {
        try {
            String s = "";
            String tempString;
            int i = 0;
            while ((tempString = br.readLine()) != null) {
                s = s.concat(tempString);
                // s=s+tempString;
                i = i + 1;
                if (i % 1000 == 0) {
                    System.out.println(Integer.toString(i));
                }
            }
            return s;
        } finally {
            br.close();
        }
    }

    public static void main(String[] args) throws IOException {
        ReadXmlLines r = new ReadXmlLines();

        URL url = ReadXmlLines.class.getResource("xml.xml");
        String xmlStr = r.read1(new BufferedReader(new InputStreamReader(url
                .openStream())));

        Object ob = null;

        XStream xs = new XStream();
        xs.alias("root", Root.class);

        // This is incorrectly read/parsed, as the line endings are not
        // preserved.
        System.out.println("----------1");
        System.out.println(xmlStr);
        ob = xs.fromXML(xmlStr);
        System.out.println(ob);

        // This is correctly read/parsed, when passing in the URL directly
        ob = xs.fromXML(url);
        System.out.println("----------2");
        System.out.println(ob);

        // This is correctly read/parsed, when passing in the InputStream
        // directly
        ob = xs.fromXML(url.openStream());
        System.out.println("----------3");
        System.out.println(ob);
    }

    public static class Root {
        public String script;

        public String toString() {
            return script;
        }
    }
}

还需要在类路径上（与类文件在同一个包中）放置名为 xml.xml 的文件：

<root>
    <script>
<![CDATA[
// taken from http://www.w3schools.com/xml/xml_cdata.asp
function matchwo(a,b)
{
if (a < b && a < 0) then
  {
  return 1;
  }
else
  {
  return 0;
  }
}
]]>
    </script>
</root>

以下是输出结果。前两行显示换行符已被删除，因此使CDATA部分中的Javascript无效（因为第一个JS注释现在注释掉了整个JS，因为JS行已被合并）。

----------1
<root>    <script><![CDATA[// taken from http://www.w3schools.com/xml/xml_cdata.aspfunction matchwo(a,b){if (a < b && a < 0) then  {  return 1;  }else  {  return 0;  }}]]>    </script></root>
// taken from http://www.w3schools.com/xml/xml_cdata.aspfunction matchwo(a,b){if (a < b && a < 0) then  {  return 1;  }else  {  return 0;  }}    
----------2


// taken from http://www.w3schools.com/xml/xml_cdata.asp
function matchwo(a,b)
{
if (a < b && a < 0) then
  {
  return 1;
  }
else
  {
  return 0;
  }
}
...

- Paul Grime

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tom Carchrae · Accepted Answer

@PaulGrime说得对。每次循环读取一行时，都会复制该字符串。一旦字符串变得很大（比如10,000行），它就会做很多复制工作。

试试这个：

StringBuilder sb = new StringBuilder();
while (...reading lines..){ 
   ....
   sb.append(tempString);  //should add newline
   ...
}

s = sb.toString();

注意：请参考下面Paul的答案，了解为什么去除换行符会使得读取文件变得不好。并且，正如问题评论中提到的那样，XStream提供了一种读取文件的方法，即使没有它，IOUtils.toString(reader)也是更安全的读取文件的方式。