如何使用Files.walk仅一次读取子目录中的所有文件？

Question

如何使用Files.walk仅一次读取子目录中的所有文件？

4

我正在尝试读取目录下所有子目录中的文件。我已经编写了逻辑，但是由于某些原因它会将每个文件读取两次。

为了测试我的实现，我创建了一个包含三个子目录的目录，每个子目录中有10个文档，总共应该有30个文档。

以下是我用于测试正确读入文档的代码：

String basePath = "src/test/resources/20NG";
Driver driver = new Driver();
List<Document> documents = driver.readInCorpus(basePath);
assertEquals(3 * 10, documents.size());

Driver#readInCorpus 的代码如下：

public List<Document> readInCorpus(String directory)
{
    try (Stream<Path> paths = Files.walk(Paths.get(directory)))
    {
        return paths
                .filter(Files::isDirectory)
                .map(this::readAllDocumentsInDirectory)
                .flatMap(Collection::stream)
                .collect(Collectors.toList());
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    return Collections.emptyList();
}

private List<Document> readAllDocumentsInDirectory(Path path)
{
    try (Stream<Path> paths = Files.walk(path))
    {
        return paths
                .filter(Files::isRegularFile)
                .map(this::readInDocumentFromFile)
                .collect(Collectors.toList());
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    return Collections.emptyList();
}

private Document readInDocumentFromFile(Path path)
{
    String fileName = path.getFileName().toString();
    String outputClass = path.getParent().getFileName().toString();
    List<String> words = EmailProcessor.readEmail(path);
    return new Document(fileName, outputClass, words);
}

当我运行测试用例时，我发现assertEquals失败了，因为检索到了60个文档，而不是30个，这是错误的。当我进行调试时，所有文档都被插入到列表中一次，然后按完全相同的顺序再次插入。

在我的代码中，我在哪里读取了文档两次？

- Cache Staheli

2个回答

3

看起来这是对于Paths和Files.walk工作方式的误解。在Driver#readInCorpus方法中，你有以下的流操作：

return paths
        .filter(Files::isRegularFile)
        .map(this::readInDocumentFromFile)
        .collect(Collectors.toList());

你的映射函数（this::readInDocumentFromFile）从Paths.walk流中读取每个路径中的每个目录中的所有文档，包括顶级目录和子目录。

这意味着路径中起始目录下面的所有文件都会被读取一次，然后在遍历子目录时重新读取。

从流的外观来看，这并不完全清楚，但你应该看一下如何使用lambda表达式调试stream().map(...)？以更好地调试流并避免未来出现此问题。

这意味着你可以跳过调用Driver#readAllDocumentsInDirectory的中间步骤，只需在Driver#readInCorpus中执行此操作：

public List<Document> readInCorpus(String directory)
{
    try (Stream<Path> paths = Files.walk(Paths.get(directory)))
    {
        return paths
                .filter(Files::isRegularFile)
                .map(this::readInDocumentFromFile)
                .collect(Collectors.toList());
    }
    catch (IOException e)
    {
        e.printStackTrace();
    }
    return Collections.emptyList();
}

- Cache Staheli

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Orest · Accepted Answer

问题出在Files.walk(path)方法上。你使用它的方式是错误的。所以它会像遍历树一样遍历你的文件系统。例如，你有3个文件夹 - /parent和2个子文件夹/parent/first，/parent/second。 Files.walk("/parent")将为每个文件夹-父文件夹和2个子文件夹提供树形路径，并且实际上这是在你的readInCorpus方法中发生的。

然后对于每个Path，你都会调用第二个方法readAllDocumentsInDirectory，同样的故事也会发生在这里，它会像遍历树一样遍历文件夹。

对于带有/parent路径的readAllDocumentsInDirectory，它将返回来自两个子文件夹/parent/first和/parent/second的所有文档，然后你还有2个对/parent/first，/parent/second的readAllDocumentsInDirectory调用，它们从两个文件夹中返回文档。

这就是为什么你的文档会重复的原因。要解决这个问题，你应该只使用Paths.get(basePath)参数调用readAllDocumentsInDirectory方法，并删除readInCorpus方法。