让我们看看data.table
有多快,并将其与使用dplyr
进行比较。以下大致是在dplyr
中执行的方式。
data %>% group_by(PID, Time, Site, Rep) %>%
summarise(totalCount = sum(Count)) %>%
group_by(PID, Time, Site) %>%
summarise(mean(totalCount))
根据问题的具体解释,可能是这样:
data %>% group_by(PID, Time, Site) %>%
summarise(totalCount = sum(Count), meanCount = mean(Count)
这是一个完整的例子,展示了这些替代方法与@Ramnath提出的答案以及@David Arenburg在评论中提出的答案之间的区别,我认为这个例子等同于第二个
dplyr
语句。
nrow <- 510000
data <- data.frame(PID = sample(letters, nrow, replace = TRUE),
Time = sample(letters, nrow, replace = TRUE),
Site = sample(letters, nrow, replace = TRUE),
Rep = rnorm(nrow),
Count = rpois(nrow, 100))
library(dplyr)
library(data.table)
Rprof(tf1 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
summarise(totalCount = sum(Count)) %>%
group_by(PID, Time, Site) %>%
summarise(mean(totalCount))
Rprof()
summaryRprof(tf1)
Rprof(tf2 <- tempfile())
ans <- data %>% group_by(PID, Time, Site, Rep) %>%
summarise(total = sum(Count), meanCount = mean(Count))
Rprof()
summaryRprof(tf2)
Rprof(tf3 <- tempfile())
data_t = data.table(data)
ans = data_t[,list(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf3)
Rprof(tf4 <- tempfile())
ans <- setDT(data)[,.(A = sum(Count), B = mean(Count)), by = 'PID,Time,Site']
Rprof()
summaryRprof(tf4)
数据表法非常快,而
setDT
更快!