我有15个数据文件,每个文件大小约为4.5GB。每个文件代表约17,000位客户一个月的数据。所有数据一共涵盖了15个月份内17,000位客户的信息。我想重新格式化这些数据,不再是15个月份的文件,而是针对每位客户的17,000个文件,包含他们的全部数据。我编写了一个脚本来实现这个目标:
#the variable 'files' is a vector of locations of the 15 month files
exists = NULL #This vector keeps track of customers who have a file created for them
for (w in 1:15){ #for each of the 15 month files
month = fread(files[w],select = c(2,3,6,16)) #read in the data I want
custlist = unique(month$CustomerID) #a list of all customers in this month file
for (i in 1:length(custlist)){ #for each customer in this month file
curcust = custlist[i] #the current customer
newchunk = subset(month,CustomerID == curcust) #all the data for this customer
filename = sprintf("cust%s",curcust) #what the filename is for this customer will be, or is
if ((curcust %in% exists) == TRUE){ #check if a file has been created for this customer. If a file has been created, open it, add to it, and read it back
custfile = fread(strwrap(sprintf("C:/custFiles/%s.csv",filename)))#read in file
custfile$V1 = NULL #remove an extra column the fread adds
custfile= rbind(custfile,newchunk)#combine read in data with our new data
write.csv(custfile,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
} else { #if it has not been created, write newchunk to a csv
write.csv(newchunk,file = strwrap(sprintf("C:/custFiles/%s.csv",filename)))
exists = rbind(exists,curcust,deparse.level = 0) #add customer to list of existing files
}
}
}
这段脚本是可行的(至少我很确定)。问题在于它运行速度极慢。按照目前的速度,完成这个任务将需要一周甚至更久,而我没有那么多时间。有没有什么更好、更快的方法可以用R语言实现这个任务?或者我应该尝试使用SQL来完成?虽然我以前从未使用过SQL,但你们中是否有人能够向我展示如何使用它来完成类似的操作呢?希望能得到任何建议。
rbind
和昂贵的内存复制操作,而是使用fread
,然后使用append=TRUE
将子集write.table
。这样会更快。 - daroczig