我正在读取一个数据文件,它看起来像这样:
userId, fullName,email,password,activated,registrationDate,locale,notifyOnUpdates,lastSyncTime,plan_id,plan_period_months,plan_price,plan_exp_date,plan_is_trial,plan_is_trial_used,q_hear,q_occupation,pp_subid,pp_payments,pp_since,pp_cancelled,apikey
"2","John Smith,"john.smith@gmail.com","a","1","2004-07-23 14:19:32","en_US","1","2011-04-07 07:29:17","3",\N,\N,\N,"0","1",\N,\N,\N,\N,\N,\N,"d7734dce-4ae2-102a-8951-0040ca38ff83"
但实际文件大约有20000条记录。我使用以下R代码将其读入:
user = read.csv("~/Desktop/dbdump/users.txt", na.strings = "\\N", quote="")
我使用 quote=""
的原因是,如果没有它,导入就会过早停止。最终我有9569个观测值。虽然我不明白为什么 quote=""
会解决这个问题,但它似乎确实有用。
但是,这会引入其他问题,我需要“修复”它们。我看到的第一个问题是日期最终变成包含引号的字符串,当我在它们上面使用 to.Date()
时,它们无法转换为实际日期。
现在,我可以修复字符串并硬着头皮做下去。但最好还是更多地了解我正在做的事情。有人能解释一下吗:
- 为什么
quote=""
可以修复“坏数据”? - 有什么最佳实践技巧可以找出导致
read.csv
过早停止的原因吗?(如果我只查看输入数据的+/-指示行,我看不出任何问题)。
这里是“问题”附近的行。你看到了损坏的地方吗?
"16888","user1","user1@gmail.com","TeilS12","1","2008-01-19 08:47:45","en_US","0","2008-02-23 16:51:53","1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"ad949a8e-17ed-102b-9237-0040ca390025"
"16889","user2","user2@gmail.com","Gaspar","1","2008-01-19 10:34:11","en_US","1",\N,"1",\N,\N,\N,"0","0","email","journalist",\N,\N,\N,\N,"8b90f63a-17fc-102b-9237-0040ca390025"
"16890","user3","user3@gmail.com","boomblaadje","1","2008-01-19 14:36:54","en_US","0",\N,"1",\N,\N,\N,"0","0","article","student",\N,\N,\N,\N,"73f31f4a-181e-102b-9237-0040ca390025"
"16891","user4","user4@gmail.com","mytyty","1","2008-01-19 15:10:45","en_US","1","2008-01-19 15:16:45","1",\N,\N,\N,"0","0","google-ad","student",\N,\N,\N,\N,"2e48e308-1823-102b-9237-0040ca390025"
"16892","user5","user5@gmail.com","08091969","1","2008-01-19 15:12:50","en_US","1",\N,"1",\N,\N,\N,"0","0","dont","dont",\N,\N,\N,\N,"79051bc8-1823-102b-9237-0040ca390025"
* 更新 *
这更加棘手。虽然导入的总行数为9569行,但是如果我看一下最后几行,它们对应于数据的最后几行。因此,我推测在导入过程中发生了某些事情,导致许多行被跳过。实际上,15914-9569=6345条记录。当我使用quote=""时,得到的值是15914。
那么我的问题可以修改为: 有没有办法让read.csv报告它决定不导入的行?
* 更新2 *
@Dwin,我必须删除na.strings="\N",因为count.fields函数不允许它。这样,我得到了这个看起来很有意思但我不理解的输出。
3 4 22 23 24
1 83 15466 178 4
你的第二个命令会产生大量数据(并在达到 max.print 时停止),但第一行是这样的:
[1] 2 4 2 3 5 3 3 3 5 3 3 3 2 3 4 2 3 2 2 3 2 2 4 2 4 3 5 4 3 4 3 3 3 3 3 2 4
我不明白输出结果是否应该显示输入记录中有多少个字段。显然,第一行都有超过2、4、2等字段……感觉我离答案越来越近了,但仍然感到困惑!
quote = ""
之前和之后数据的样子吗?或者,你能发一些更多的数据行,最好是一些“健康”的和一些坏的吗?简而言之,你能否提供一个可重现的例子? - Roman LuštrikreadLines
收集它们。 - Carl Witthoftwhich(count.fields("~/Desktop/dbdump/users.txt", quote="", sep=",") != 22)
。 - IRTFM