以下是员工的开始和结束日期列表。如果有成千上万的员工,我们如何高效地计算每月的员工人数呢?
begin <- as.Date(c("2015-07-06","2015-07-06","2015-07-27","2015-07-06","2015-06-29",
"2015-07-06","2015-07-06","2015-07-06","2015-07-13","2015-06-29",
"2015-07-06","2015-07-06","2015-07-13","2015-07-01","2015-07-06",
"2015-07-06","2015-07-06","2015-07-09","2015-07-13"),format = "%Y-%m-%d")
end <- as.Date(c("2018-08-03","2016-01-11","2999-12-31","2017-03-13","2999-12-31",
"2015-10-20","2999-12-31","2018-09-24","2999-12-31","2015-09-25",
"2019-11-01","2999-12-31","2018-03-26","2018-08-08","2015-10-13",
"2999-12-31","2999-12-31","2021-02-11","2999-12-31"), format = "%Y-%m-%d")
smallEmp <- data.frame(begin,end)
method_currentsolution = function() {
date_df = tibble(Date = seq(from=as.Date('2015-01-01'), to=Sys.Date(), by = "month"))
queue_history = merge(smallEmp, date_df, all=TRUE) %>%
filter(Date >= begin, Date <= end) %>%
group_by(Date) %>%
summarise(cnt = n())
}
result <- method_currentsolution()
smallEmp
是包含开始日期和结束日期的数据集。
这个解决方案可以工作,但对于非常大的数据集来说速度很慢,因为它会为每个感兴趣的月份重复所有员工。如果有任何建议可以加快这个过程,将不胜感激。
sapply
运行时间为10.45毫秒,data.table
运行时间为27.05毫秒,OP
运行时间为3.91秒,fuzzy_inner_join
运行时间为17.11秒(此时我的系统并不完全空闲)。所以基本上没有变化。 - Benjamin Christoffersentransform(count = sapply(Date, \(x) sum(x >= begin & x <= end)))
替换为类似于transform(count = c(sapply(Depart, \(d){ b <- begin[id_depart == d]; e <- end[id_depart == d]; sapply(Date[Depart == d], \(x) sum(x >= b & x <= e)) })))
的内容,它应该可以工作。其中Depart
是在开始时添加到data.frame
中的唯一部门,而id_depart
则是每个人的部门。 - Benjamin Christoffersen