读取CSV时的fread()错误和奇怪行为

3

我使用data.table库的fread()函数尝试读取一个大小为540MB的csv文件,但是出现了以下错误信息:

' ends field 36 on line 4 when detecting types: 20.00,8/25/2006 0:00:00,"07:05:00 PM","CST",143.00,"OTTAWA","KS","HAIL",1.00,"S","MINNEAPOLIS",8/25/2006 0:00:00,"07:05:00 PM",0.00,,1.00,"S","MINNEAPOLIS",0.00,0.00,,88.00,0.00,0.00,0.00,,0.00,,"TOP","KANSAS, East",,3907.00,9743.00,3907.00,9743.00,"Dime to nickel sized hail.

我不知道是什么导致了错误,希望找出到底是一个 bug 还是一些数据格式问题,可以通过调整 fread() 来处理。

我使用 read.csv() 成功读取了 csv 文件,并决定跟踪触发上述错误的行(在第 617174 行,而不是上面的错误消息中的第 4 行)。然后我重新输出了该行以及前后各一行,写成 testout.csv,并使用 write.csv() 写入。

我可以使用 read.csv() 读回 testout.csv,创建一个包含 3 个观测值的数据框,正如预期一样。但是,使用 fread() 读取 testout.csv,结果只得到一个包含最后一行的数据表。

testout.csv 中的四行如下(为了易于阅读,我为每个条目单独起了一行)。

"STATE__","BGN_DATE","BGN_TIME","TIME_ZONE","COUNTY","COUNTYNAME","STATE","EVTYPE","BGN_RANGE","BGN_AZI","BGN_LOCATI","END_DATE","END_TIME","COUNTY_END","COUNTYENDN","END_RANGE","END_AZI","END_LOCATI","LENGTH","WIDTH","F","MAG","FATALITIES","INJURIES","PROPDMG","PROPDMGEXP","CROPDMG","CROPDMGEXP","WFO","STATEOFFIC","ZONENAMES","LATITUDE","LONGITUDE","LATITUDE_E","LONGITUDE_","REMARKS","REFNUM"
20,"8/25/2006 0:00:00","07:01:00 PM","CST",139,"OSAGE","KS","TSTM WIND",5,"WNW","OSAGE CITY","8/25/2006 0:00:00","07:01:00 PM",0,NA,5,"WNW","OSAGE CITY",0,0,NA,52,0,0,0,"",0,"","TOP","KANSAS, East","",3840,9554,3840,9554,".",617129
20,"8/25/2006 0:00:00","07:05:00 PM","CST",143,"OTTAWA","KS","HAIL",1,"S","MINNEAPOLIS","8/25/2006 0:00:00","07:05:00 PM",0,NA,1,"S","MINNEAPOLIS",0,0,NA,88,0,0,0,"",0,"","TOP","KANSAS, East","",3907,9743,3907,9743,"Dime to nickel sized hail. .",617130
20,"8/25/2006 0:00:00","07:07:00 PM","CST",125,"MONTGOMERY","KS","TSTM WIND",3,"N","COFFEYVILLE","8/25/2006 0:00:00","07:07:00 PM",0,NA,3,"N","COFFEYVILLE",0,0,NA,61,0,0,0,"",0,"","ICT","KANSAS, Southeast","",3705,9538,3705,9538,"",617131"
Input contains no \n. Taking this to be a filename to open
File opened, filesize is  1.05E-06B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 5 (the last non blank line in the first 'autostart') ... found ok
Found 37 columns
First row with 37 fields occurs on line 5 (either column names or first row of data)
Some fields on line 5 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 1444144414444111441111111414444111141 (first 5 rows)
Type codes: 1444144414444111441111111414444111141 (after applying colClasses and integer64)
Type codes: 1444144414444111441111111414444111141 (after applying drop or select (if supplied)

有没有想过是什么原因导致了意外的结果和第一次错误?有没有什么解决方法?只是为了明确,我的目标是能够使用 fread() 读取主文件,尽管到目前为止 read.csv() 已经可以工作。


еҰӮжһңжӮЁдҪҝз”ЁеҹәзЎҖеҢ…дёӯзҡ„read.tableпјҲи°ғз”ЁжҜ”read.csvжҸҗдҫӣжӣҙеӨҡйҖүйЎ№пјүдјҡеҸ‘з”ҹд»Җд№Ҳпјҹ - Carl Witthoft
dfrt <- read.table("testout.csv", header=TRUE, sep=",") 给了我与 read.csv 输出相同的结果,正如所预期的那样。 - Ricky
文件中是否有换行符,可能在引号字段内?如果是这样,fread目前无法处理。在您展示的第4行后面的一行中,在倒数第二个字段中,即第4行上的“”中。 - Matt Dowle
1
或者在“一角变成五分”的字段中。看到那两个句点了吗?read.csv是否将换行符转换为句点?我认为fread在处理时应该保留换行符-你觉得呢? - Matt Dowle
1
谢谢Matt,你说得对。我刚刚再次检查了数据,“dime to nickel”后面的两个句号有一个换行符,但在上面的文本中没有体现出来。我觉得如果使用fread函数,并且明确设置了sep为其他字符时,应该保留换行符。在自动行为中,我猜换行符可能会引起混淆。但在像上面的csv文件这样具有明确分隔符的情况下,我认为换行符应该被视为其他任何字符一样处理。这似乎是read.csv处理方式,给人一种令人放心的一致性。 - Ricky
显示剩余2条评论
1个回答

4

更新:现在已经在GitHub的v1.9.3版本中修复:

Windows用户正在这里报告使用来自GitHub的最新版本成功。


我在从GitHub安装v1.9.3后仍然遇到与Ricky相同的错误。但是,我使用的是Mac平台(x86_64-apple-darwin13.1.0)。我一定是和Ricky上了同一门课,因为我的作业完全相同。 - C8H10N4O2

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接