我使用data.table
库的fread()
函数尝试读取一个大小为540MB的csv文件,但是出现了以下错误信息:
' ends field 36 on line 4 when detecting types: 20.00,8/25/2006 0:00:00,"07:05:00 PM","CST",143.00,"OTTAWA","KS","HAIL",1.00,"S","MINNEAPOLIS",8/25/2006 0:00:00,"07:05:00 PM",0.00,,1.00,"S","MINNEAPOLIS",0.00,0.00,,88.00,0.00,0.00,0.00,,0.00,,"TOP","KANSAS, East",,3907.00,9743.00,3907.00,9743.00,"Dime to nickel sized hail.
我不知道是什么导致了错误,希望找出到底是一个 bug 还是一些数据格式问题,可以通过调整 fread()
来处理。
我使用 read.csv()
成功读取了 csv 文件,并决定跟踪触发上述错误的行(在第 617174 行,而不是上面的错误消息中的第 4 行)。然后我重新输出了该行以及前后各一行,写成 testout.csv
,并使用 write.csv()
写入。
我可以使用 read.csv()
读回 testout.csv
,创建一个包含 3 个观测值的数据框,正如预期一样。但是,使用 fread()
读取 testout.csv
,结果只得到一个包含最后一行的数据表。
testout.csv
中的四行如下(为了易于阅读,我为每个条目单独起了一行)。
20,"8/25/2006 0:00:00","07:01:00 PM","CST",139,"OSAGE","KS","TSTM WIND",5,"WNW","OSAGE CITY","8/25/2006 0:00:00","07:01:00 PM",0,NA,5,"WNW","OSAGE CITY",0,0,NA,52,0,0,0,"",0,"","TOP","KANSAS, East","",3840,9554,3840,9554,".",617129
20,"8/25/2006 0:00:00","07:05:00 PM","CST",143,"OTTAWA","KS","HAIL",1,"S","MINNEAPOLIS","8/25/2006 0:00:00","07:05:00 PM",0,NA,1,"S","MINNEAPOLIS",0,0,NA,88,0,0,0,"",0,"","TOP","KANSAS, East","",3907,9743,3907,9743,"Dime to nickel sized hail. .",617130
20,"8/25/2006 0:00:00","07:07:00 PM","CST",125,"MONTGOMERY","KS","TSTM WIND",3,"N","COFFEYVILLE","8/25/2006 0:00:00","07:07:00 PM",0,NA,3,"N","COFFEYVILLE",0,0,NA,61,0,0,0,"",0,"","ICT","KANSAS, Southeast","",3705,9538,3705,9538,"",617131"
Input contains no \n. Taking this to be a filename to open
File opened, filesize is 1.05E-06B
File is opened and mapped ok
Detected eol as \r\n (CRLF) in that order, the Windows standard.
Looking for supplied sep ',' on line 5 (the last non blank line in the first 'autostart') ... found ok
Found 37 columns
First row with 37 fields occurs on line 5 (either column names or first row of data)
Some fields on line 5 are not type character (or are empty). Treating as a data row and using default column names.
Count of eol after first data row: 2
Subtracted 1 for last eol and any trailing empty lines, leaving 1 data rows
Type codes: 1444144414444111441111111414444111141 (first 5 rows)
Type codes: 1444144414444111441111111414444111141 (after applying colClasses and integer64)
Type codes: 1444144414444111441111111414444111141 (after applying drop or select (if supplied)
有没有想过是什么原因导致了意外的结果和第一次错误?有没有什么解决方法?只是为了明确,我的目标是能够使用 fread()
读取主文件,尽管到目前为止 read.csv()
已经可以工作。
read.table
пјҲи°ғз”ЁжҜ”read.csv
жҸҗдҫӣжӣҙеӨҡйҖүйЎ№пјүдјҡеҸ‘з”ҹд»Җд№Ҳпјҹ - Carl Witthoftread.csv
输出相同的结果,正如所预期的那样。 - Rickyfread
函数,并且明确设置了sep
为其他字符时,应该保留换行符。在自动行为中,我猜换行符可能会引起混淆。但在像上面的csv文件这样具有明确分隔符的情况下,我认为换行符应该被视为其他任何字符一样处理。这似乎是read.csv
处理方式,给人一种令人放心的一致性。 - Ricky