我有一个非常大的数据集,存储为.csv文件,但它无法全部载入内存。不过,我只需要其中的3列,这些列可以载入内存。我该如何加载这些列呢?
更新: 我该如何通过列名而不是列索引来选择列?我不知道它们的索引。
如果您的表格非常大,请考虑使用 data.table
包:
# create an example: 10,000 rows by 100 columns
df <- data.frame(matrix(rnorm(1e6),ncol=100))
write.csv(df,"sample.csv",row.names=F)
library(data.table)
dt <- fread("sample.csv",select=c(3,8,20))
head(dt)
# X3 X8 X20
# 1: 0.5537762 1.0271272 -0.14437400
# 2: -0.4111327 -0.2297311 -1.04998490
# 3: -1.2540440 0.6977565 -0.21514021
# 4: -1.1500974 -0.3181102 -0.07910133
# 5: -0.6549245 1.8385510 0.73741980
# 6: 0.8049360 0.4722533 -0.65750679
这个代码只读取第3、8和20列,速度非常快。
在加载到R之前,您可以使用awk进行预处理吗?如果可以的话,比如您想要第2、3和5列,您可以执行以下操作:
awk '{print $2,$3,$5}' yourfile.csv > cols23and5.csv
"Field 1","Field 2, with commas, in it","Field 3","Field 4, also with commas,,,"
"Field 1","Field 2, with commas, in it","Field 3","Field 4, also with commas,,,"
sed -e 's/","/:/g' -e 's/"//g' yourfile.csv > ColonSeparated.csv
这样你的文件就变成了:
Field 1:Field 2, with commas, in it:Field 3:Field 4, also with commas,,,
Field 1:Field 2, with commas, in it:Field 3:Field 4, also with commas,,,
然后,您可以使用冒号作为分隔符,不必担心嵌入逗号的情况,使用awk
处理它:
awk -F: '{print $2,FS,$3,FS,$4}' ColonSeparated.csv > SmallFileForR.csv