在Java中从文件中读取大量数据

Question

在Java中从文件中读取大量数据

19

我有一个文本文件，其中包含以下格式的1 000 002个数字：

123 456
1 2 3 4 5 6 .... 999999 100000

现在我需要读取这些数据并将它们分配给int变量（前两个数字）和所有其余的数字（1,000,000个数字）分配到一个int[]数组中。

这不是一项难任务，但是速度非常慢。

我的第一次尝试是使用java.util.Scanner：

 Scanner stdin = new Scanner(new File("./path"));
 int n = stdin.nextInt();
 int t = stdin.nextInt();
 int array[] = new array[n];

 for (int i = 0; i < n; i++) {
     array[i] = stdin.nextInt();
 }

它的功能符合预期，但执行时间约为7500毫秒。我需要在几百毫秒内获取这些数据。

然后我尝试了`java.io.BufferedReader`：

使用BufferedReader.readLine()和String.split()，我在约1700毫秒内获得了相同的结果，但仍然太多。

如何在不到1秒的时间内读取这么多数据？最终结果应该等同于：

int n = 123;
int t = 456;
int array[] = { 1, 2, 3, 4, ..., 999999, 100000 };

根据trashgod的回答：

StreamTokenizer方案速度较快（约需1400毫秒），但仍然太慢：

StreamTokenizer st = new StreamTokenizer(new FileReader("./test_grz"));
st.nextToken();
int n = (int) st.nval;

st.nextToken();
int t = (int) st.nval;

int array[] = new int[n];

for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
    array[i] = (int) st.nval;
}

注意：无需验证。我100％确定./test_grz文件中的数据是正确的。

- Crozin

如果你打算在列表中移动它们或对它们进行排序，为什么不将它们存储到LinkedList中呢？如果你需要随机访问，可以使用ArrayList（根据你的使用方式）。那是一大批数据，我假设你以后还会用到它们。 - Humphrey Bogart

1

我改变了我的问题 - 它只涉及从文件中读取。 ;) 我不需要任何集合 - 简单数组是我真正需要的，但问题在于如何以最快的方式从文件中填充此数组的数据。 - Crozin

2

分配大数组（近4MB）和解析之间花费了多少时间？您能否在该调用之外分配数组？假设您正在使用Integer.parseInt，您是否寻找过其他可能针对十进制优化整数解析的库？这个声称可以：http://www.cs.ou.edu/~weaver/improvise/downloads/javadoc/oblivion/oblivion/util/NumberUtilities.html#parseInt%28java.lang.String%29 - NG.

只是随便想想 - 你能否创建一个单独的程序来将文件分割成多个文件，然后使用单独的线程读取数据，最后再合并结果？虽然这会使整个设计更加复杂，但不确定它是否会更快。 - JRL

@JRL：这是不可能的。一切都必须在此程序内进行，以最简单、最原始但快速的方式完成。 - Crozin

显示剩余2条评论

7个回答

2

使用BufferedReader可以缩短StreamTokenizer结果的时间:

Reader r = null;
try {
    r = new BufferedReader(new FileReader(file));
    final StreamTokenizer st = new StreamTokenizer(r);
    ...
} finally {
    if (r != null)
        r.close();
}

此外，不要忘记关闭文件，就像我在这里展示的一样。

您还可以通过使用专门为您的目的设计的自定义标记解析器，节省更多时间：

public class CustomTokenizer {

    private final Reader r;

    public CustomTokenizer(final Reader r) {
        this.r = r;
    }

    public int nextInt() throws IOException {
        int i = r.read();
        if (i == -1)
            throw new EOFException();

        char c = (char) i;

        // Skip any whitespace
        while (c == ' ' || c == '\n' || c == '\r') {
            i = r.read();
            if (i == -1)
                throw new EOFException();
            c = (char) i;
        }

        int result = (c - '0');
        while ((i = r.read()) >= 0) {
            c = (char) i;
            if (c == ' ' || c == '\n' || c == '\r')
                break;
            result = result * 10 + (c - '0');
        }

        return result;
    }

}

请记得使用BufferedReader。这个自定义的分词器假设输入数据总是完全有效的，并且只包含空格、换行符和数字。

如果您经常读取这些结果，而且这些结果不会经常变化，那么您应该保存数组并跟踪最后修改文件的时间。然后，如果文件没有改变，只需使用缓存的数组副本即可显著加快结果的速度。例如：

public class ArrayRetriever {

    private File inputFile;
    private long lastModified;
    private int[] lastResult;

    public ArrayRetriever(File file) {
        this.inputFile = file;
    }

    public int[] getResult() {
        if (lastResult != null && inputFile.lastModified() == lastModified)
            return lastResult;

        lastModified = inputFile.lastModified();

        // do logic to actually read the file here

        lastResult = array; // the array variable from your examples
        return lastResult;
    }

}

- Kevin Brock

谢谢你的回答 - 我明天会检查一下 - 希望这就是我要找的。 - Crozin

在构造BufferedReader时，指定缓冲区大小可能是值得的。+1 - trashgod

2

StreamTokenizer可能会更快，正如这里所建议的。

- trashgod

事实上，StreamTokenizer 看起来是目前最快的解决方案（请查看我的问题更新）。但它仍然需要大约 1400 毫秒来读取必要的数据。 - Crozin

非常好。还可以参考@Kevin Brock的信息丰富的答案：https://dev59.com/YnE85IYBdhLWcg3wnU0d#2694507 - trashgod

1

你的电脑有多少内存？你可能会遇到GC问题。

如果可能的话，最好逐行处理数据。不要将其加载到数组中。只加载所需内容，进行处理，写出并继续。

这将减少您的内存占用，并仍然使用相同数量的文件IO。

- Pyrolistical

看起来他的第二行是一个包含了一百万个数字的超长行。 - NG.

如果我的计算是正确的，那么1百万个int只需要7 MB的内存 - 这并不多。我只需要将这些数据从文件加载到内存中 - 我需要对整个数据进行计算。 - Crozin

1

如果可以重新格式化输入，使每个整数都在单独的一行上（而不是一个包含一百万个整数的长行），那么使用Integer.parseInt(BufferedReader.readLine())应该会有更好的性能，因为它会智能地按行缓冲，而不需要将长字符串拆分成单独的字符串数组。

编辑：我测试了一下，并成功将seq 1 1000000生成的输出读入到一个int数组中，用时不到半秒钟，但这当然取决于机器。

- Arkku

很遗憾，我无法更改文件格式。第一行必须是两个整数，用单个空格分隔，第二行必须是100万个整数（也用单个空格分隔）。 - Crozin

0

使用 BufferedReader 上的 StreamTokenizer 就能获得相当不错的性能。你不需要自己编写 readInt() 函数。

以下是我用于进行本地性能测试的代码：

/**
 * Created by zhenhua.xu on 11/27/16.
 */
public class MyReader {

private static final String FILE_NAME = "./1m_numbers.txt";
private static final int n = 1000000;

public static void main(String[] args) {
    try {
        readByScanner();
        readByStreamTokenizer();
        readByStreamTokenizerOnBufferedReader();
        readByBufferedInputStream();
    } catch (Exception e) {
        e.printStackTrace();
    }
}

public static void readByScanner() throws Exception {
    long startTime = System.currentTimeMillis();

    Scanner stdin = new Scanner(new File(FILE_NAME));
    int array[] = new int[n];
    for (int i = 0; i < n; i++) {
        array[i] = stdin.nextInt();
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by Scanner: %d ms", endTime - startTime));
}

public static void readByStreamTokenizer() throws Exception {
    long startTime = System.currentTimeMillis();

    StreamTokenizer st = new StreamTokenizer(new FileReader(FILE_NAME));
    int array[] = new int[n];

    for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
        array[i] = (int) st.nval;
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by StreamTokenizer: %d ms", endTime - startTime));
}

public static void readByStreamTokenizerOnBufferedReader() throws Exception {
    long startTime = System.currentTimeMillis();

    StreamTokenizer st = new StreamTokenizer(new BufferedReader(new FileReader(FILE_NAME)));
    int array[] = new int[n];

    for (int i = 0; st.nextToken() != StreamTokenizer.TT_EOF; i++) {
        array[i] = (int) st.nval;
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time by StreamTokenizer with BufferedReader: %d ms", endTime - startTime));
}

public static void readByBufferedInputStream() throws Exception {
    long startTime = System.currentTimeMillis();

    BufferedInputStream bis = new BufferedInputStream(new FileInputStream(FILE_NAME));
    int array[] = new int[n];
    for (int i = 0; i < n; i++) {
        array[i] = readInt(bis);
    }

    long endTime = System.currentTimeMillis();
    System.out.println(String.format("Total time with BufferedInputStream: %d ms", endTime - startTime));
}

private static int readInt(InputStream in) throws IOException {
    int ret = 0;
    boolean dig = false;

    for (int c = 0; (c = in.read()) != -1; ) {
        if (c >= '0' && c <= '9') {
            dig = true;
            ret = ret * 10 + c - '0';
        } else if (dig) break;
    }

    return ret;
}

我得到的结果：

使用Scanner的总时间：789毫秒
使用StreamTokenizer的总时间：226毫秒
使用带有BufferedReader的StreamTokenizer的总时间：80毫秒
使用BufferedInputStream的总时间：95毫秒

- Zhenhua Xu

0

我会扩展FilterReader并在read()方法中解析字符串。编写一个getNextNumber方法来返回数字。代码留给读者自己练习。

- Skip Head

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Crozin · Accepted Answer

感谢每一个回答，但我已经找到了符合我的要求的方法：

BufferedInputStream bis = new BufferedInputStream(new FileInputStream("./path"));
int n = readInt(bis);
int t = readInt(bis);
int array[] = new int[n];
for (int i = 0; i < n; i++) {
    array[i] = readInt(bis);
}

private static int readInt(InputStream in) throws IOException {
    int ret = 0;
    boolean dig = false;

    for (int c = 0; (c = in.read()) != -1; ) {
        if (c >= '0' && c <= '9') {
            dig = true;
            ret = ret * 10 + c - '0';
        } else if (dig) break;
    }

    return ret;
}

仅需要约300毫秒即可读取1百万个整数！

在Java中从文件中读取大量数据

然后我尝试了java.io.BufferedReader：

根据trashgod的回答：

然后我尝试了`java.io.BufferedReader`：