我正在使用dplyr对我的巨型数据框(b
)进行一些数据操作。我已经成功地在较小的数据子集上工作。我猜测我的问题出在数据框的大小上。
我有一个具有4百万行和34列的数据框。
我的代码如下:
df<-b %>%
group_by(Id) %>%
mutate(numberoflead = n(),#lead sayısı
lastcreateddateoflead=max(CreatedDate),#last date of lead
firstcreateddateoflead=min(CreatedDate),#first date of lead
lastcloseddate=max(Kapanma.tarihi....),#last closed date of kapanm tarihi
yas=as.Date(lastcloseddate)-as.Date(firstcreateddateoflead),#yas
leadduration=as.Date(lastcreateddateoflead)-as.Date(firstcreateddateoflead)) %>%#lead duration
inner_join(b %>%
select(Id, CreatedDate, lasttouch = Lead_DataSource__c),
by = c("Id" = "Id", "lastcreateddateoflead" = "CreatedDate")) %>% #lasttouch
inner_join(b %>%
select(Id, CreatedDate, firsttouch = Lead_DataSource__c),
by = c("Id" = "Id", "firstcreateddateoflead" = "CreatedDate")) %>% #firsttouch
inner_join(b %>%
select(Id, Kapanma.tarihi...., laststagestatus = StageName),#laststagestatus
by = c("Id" = "Id", "lastcloseddate" = "Kapanma.tarihi...."))
它在我数据帧的较小子集上运行良好,但当我将以上代码运行到完整的数据帧上时,运行时间非常长,并最终崩溃。我认为问题可能出在数据帧的4百万行上。
有人有什么建议如何解决这个问题吗?谢谢你们的帮助!
data.table
,即setDT(b)[, c('numberoflead', 'lastcreateddateoflead') := .(.N, max(CreatedDate)), Id]
。 - akrundtplyr
(dplyr 的数据表后端)和dbplyr
(dplyr 的 SQL 数据库后端)。 - Ben Bolkerdbplyr
或参见https://cran.r-project.org/web/views/HighPerformanceComputing.html)。 - Ben Bolker