使用data.table（带有fread）快速读取和合并多个文件

Question

使用data.table（带有fread）快速读取和合并多个文件

42

我有几个结构相同的txt文件。现在我想使用fread将它们读入R，然后将它们合并成一个更大的数据集。

## First put all file names into a list 
library(data.table)
all.files <- list.files(path = "C:/Users",pattern = ".txt")

## Read data using fread
readdata <- function(fn){
    dt_temp <- fread(fn, sep=",")
    keycols <- c("ID", "date")
    setkeyv(dt_temp,keycols)  # Notice there's a "v" after setkey with multiple keys
    return(dt_temp)

}
# then using 
mylist <- lapply(all.files, readdata)
mydata <- do.call('rbind',mylist)

代码可以正常运行，但速度不太令人满意。每个txt文件有1M观测值和12个字段。

如果我使用fread读取单个文件，速度很快。但是使用apply后，速度非常慢，明显比一个一个地读取文件花费更多的时间。我想知道出了什么问题，在这里是否有任何提高速度的改进方法？

我尝试了plyr包中的llply，但没有多少速度提升。

此外，data.table中是否有类似于sql中的rbind和union的语法来实现垂直连接?

谢谢。

- Bigchao

2个回答

2

我已经多次重写了代码来完成这个任务。最终将其编写成一个方便的函数，如下所示。

data.table_fread_mult <- function(filepaths = NULL, dir = NULL, recursive = FALSE, pattern = NULL, fileCol = FALSE, ...){
  # fread multiple filepaths and then combine the results into a single data.table
  # This function has two interfaces: either
  # 1) provide `filepaths` as a character vector of filepaths to read or 
  # 2) provide `dir` (and optionally `pattern` and `recursive`) to identify the directory to read from
  # If fileCol = TRUE, result will incude a column called File with the full source file path of each record
  # ... should be arguments to pass on to fread()
  # `pattern` is an optional regular expression to match files (e.g. pattern='csv$' matches files ending with 'csv')
  
  if(!is.null(filepaths) & (!is.null(dir) | !is.null(pattern))){
    stop("If `filepaths` is given, `dir` and `pattern` should be NULL")
  } else if(is.null(filepaths) & is.null(dir)){
    stop("If `filepaths` is not given, `dir` should be given")
  }
  
  # If filepaths isn't given, build it from dir, recursive, pattern
  if(is.null(filepaths)){
    filepaths <- list.files(
      path = dir, 
      full.names = TRUE, 
      recursive = recursive, 
      pattern = pattern
    )
  }
  
  # Read and combine files
  if(fileCol){
    return(rbindlist(lapply(filepaths, function(x) fread(x, ...)[, File := x]), use.names = TRUE))
  } else{
    return(rbindlist(lapply(filepaths, fread, ...), use.names = TRUE))
  }
}

- Ben

1

我该如何使用这个函数添加一个包含文件名的列？例如，如果我的目录中有 sample1.txt sample2.txt sample3.txt，我想要读取并将它们合并到一个数据表中，其中 V2 作为文件名（例如，sample1）。因此，我的数据看起来应该是 sample1scontent sample1。 - Isin Altinkaya

1

这通常仍然太慢了。以下是一些速度快25-50倍的方法 https://dev59.com/Smgu5IYBdhLWcg3wTFSi#58131427 - webb

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Simon O'Hanlon · Accepted Answer

61

使用rbindlist()函数，它专门用于将data.table的列表一起rbind连接...

mylist <- lapply(all.files, readdata)
mydata <- rbindlist( mylist )

正如@Roland所说，不要在每次迭代函数时设置键值！

因此，总结起来，最好的方法是：

l <- lapply(all.files, fread, sep=",")
dt <- rbindlist( l )
setkey( dt , ID, date )

- Simon O'Hanlon

4

最好在最后只设置一次密钥。 - Roland

@SimonO'Hanlon，非常感谢。for循环比lapply更快吗？ - Bigchao

@Bigchao 不确定。但是如果你考虑一下，你希望你的99.999%的处理时间是什么？是for或lapply的计算开销还是读取1e6个数据观测值的开销？在这种情况下，这完全是任意的。我认为使用for循环可能会更好地管理内存，而且它肯定不比lapply差。两者之间没有速度差异。 - Simon O'Hanlon

@SimonO'Hanlon 非常感谢 :) - Bigchao

1

如果您正在调用工作目录之外的文件，请确保在list.files()中添加full.names = TRUE，例如list.files(path = "C:/Users",pattern = ".txt",full.names=TRUE)。这将附加完整的文件路径到每个被调用的文件，使得lapply函数能够成功地定位和操作每个文件。 - TheSciGuy

这通常仍然太慢了。以下是一些速度快25-50倍的方法 https://dev59.com/Smgu5IYBdhLWcg3wTFSi#58131427 - webb