如何使用fread()作为readLines(),而不进行自动列检测?

12

我有一个5GB的.dat文件(> 1000万行)。每行的格式如下:aaaa bb cccc0123 xxx kkkkkkkkkkkkkk或者aaaaabbbcccc01234xxxkkkkkkkkkkkkkk。由于使用readLines读取大文件时性能较差,我选择使用fread()来读取,但遇到了错误:

library("data.table")
x <- fread("test.DAT")
Error in fread("test.DAT") : 
  Expecting 5 cols, but line 5 contains text after processing all cols. It is very likely that this is due to one or more fields having embedded sep=' ' and/or (unescaped) '\n' characters within unbalanced unescaped quotes. fread cannot handle such ambiguous cases and those lines may not have been read in as expected. Please read the section on quotes in ?fread.
In addition: Warning message:
In fread("test.DAT") :
  Unable to find 5 lines with expected number of columns (+ middle)

如何在不自动检测列的情况下,将fread()用作readLines()?或者还有其他解决此问题的方法吗?

1个回答

27

这里有一个技巧。您可以使用一个您知道文件中没有的sep值。这会强制fread()将整行作为单个列读取。随后我们可以将该列转换为原子向量(如下所示:[[1L]])。以下是在CSV上使用?作为sep的示例。这种方式类似于readLines(),但速度要快得多。

f <- fread("Batting.csv", sep= "?", header = FALSE)[[1L]]
head(f)
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"       
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"  
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,," 
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,,"

你可以尝试在sep中使用的其他不常见字符有\ ^ @ # =等。我们可以看到,这将产生与readLines()相同的输出结果。只需要找到一个文件中不存在的sep值即可。

head(readLines("Batting.csv"))
# [1] "playerID,yearID,stint,teamID,lgID,G,AB,R,H,2B,3B,HR,RBI,SB,CS,BB,SO,IBB,HBP,SH,SF,GIDP"
# [2] "abercda01,1871,1,TRO,NA,1,4,0,0,0,0,0,0,0,0,0,0,,,,,"                                  
# [3] "addybo01,1871,1,RC1,NA,25,118,30,32,6,0,0,13,8,1,4,0,,,,,"                             
# [4] "allisar01,1871,1,CL1,NA,29,137,28,40,4,5,0,19,3,1,2,5,,,,,"                            
# [5] "allisdo01,1871,1,WS3,NA,27,133,28,44,10,2,2,27,1,1,0,2,,,,,"                           
# [6] "ansonca01,1871,1,RC1,NA,25,120,29,39,11,3,0,16,6,2,2,1,,,,," 

注意:正如@Cath在评论中提到的那样,您还可以简单地使用换行符字符\n作为sep值。


2
这个应该得到很多赞。不错的技巧,实际上在我的情况下使用sep='~'是有效的。 - marbel
1
为什么不使用 sep="\n" 呢? - Cath
@Cath - 是的,我猜那也可以使用。 - Rich Scriven
我知道这个线程很旧了 - 但是现在我该怎么处理这些行呢?有什么高效的方法可以将这些行转换为 data.table 吗? - lukehawk
@lukehawk - 如果你有像上面那样的字符向量,你可以使用 fread(paste(f, collapse = "\n"))。否则,我会直接使用 fread 从文件中读取。 - Rich Scriven
@lukehawk,即使在小于100,000个对象大小的情况下,fread(paste(f, collapse = "\n"))执行起来也需要很长时间。 - Lazarus Thurston

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接