如何在R中按开始日期和结束日期间隔计算记录数?

8

我可以为您翻译。以下是需要翻译的内容:

我有一个数据框,其中包含项目,对于每个项目,都有一个开始日期和结束日期可用。我想知道在某个时间段内每天有多少项目处于活动状态。

示例数据集:

ItemId <- c(1,2,3)
StartDate <- c(ymd("2014-01-01"),ymd("2014-02-01"),ymd("2014-03-01"))
EndDate <- c(ymd("2014-02-15"),ymd("2014-02-07"),ymd("2014-03-03"))
data.frame(ItemId,StartDate,EndDate)
  ItemId           StartDate             EndDate
1      1 2014-01-01 01:00:00 2014-02-15 01:00:00
2      2 2014-02-01 01:00:00 2014-02-07 01:00:00
3      3 2014-03-01 01:00:00 2014-03-03 01:00:00

结果应该类似于这样(每天一个条目):
Date        ActiveCount
2014-01-01  1
2014-01-02  1
...
2014-02-01  2
...

我有一个使用sqldf的解决方案,但不确定如何在R中实现。

select d.date
,      ( select count(ItemID)
         from   items
         where  startdate <= d.date
         and    enddate >= d.date
       ) activecount
from   (select distinct startdate from items
        union
        select distinct enddate from items
       ) d
order by 1

我的数据每天有多个条目,所以在R中使用sqlite可以实现这一点。在postgresql中,我可以生成一系列日期,这更好一些。谢谢。

请查看data.table包中的foverlaps()函数。或者在这里在SO上搜索它。 - Arun
请注意,SQL语句可以使用where d.date is between startdate and enddate - G. Grothendieck
4个回答

9

将您的数据称为df

dates = seq(min(df$StartDate), max(df$EndDate), by = "day")

counts = data.frame(date = dates,
                    count = sapply(dates, function(x) sum(x <= df$EndDate & x >= df$StartDate)))

6

每当一个 R 任务类似于 SQL 任务时,就应该考虑使用 dplyr

library(dplyr) 
ItemId <- c(1,2,3)
StartDate <- c(ymd("2014-01-01"),ymd("2014-02-01"),ymd("2014-03-01"))
EndDate <- c(ymd("2014-02-15"),ymd("2014-02-07"),ymd("2014-03-03"))

jim <- data.frame(ItemId,StartDate,EndDate)

# One technique that's often useful especially in R, is to take your 
# iterator, and define it as a variable.  You can use that to implement
# a vectorised version of whatever you were thinking of doing.*/

ed <- data.frame(rng = seq(min(jim$StartDate), max(jim$EndDate), by = 'day'))
merge(jim, ed, all=TRUE) %>% 
     filter(rng >= StartDate, rng <= EndDate) %>%
     group_by(rng) %>% 
     summarise(n())

这将给您带来:
    rng         n()
1   2014-01-01  1 
2   2014-01-02  1
3   2014-01-03  1
...

5

我已经多次回到这个问题,不断寻找最有效的解决方法。

之前我使用过map-reduce方法,但发现它不能很好地处理具有广泛日期间隔的大型数据框。我最近尝试使用lubridate包中的interval类,并发现这是迄今为止最快的实现。

以下是最终代码:

library(tidyverse)
library(lubridate)

# Initialize a dataframe with start and end "active" dates per object
N = 1000
id_dates = tibble(id = 1 : N) %>%
  mutate(
    start = sample(seq(as.Date('2018-1-1'), as.Date('2019-1-1'), by = "day"), size = N, replace = TRUE),
    end   = start + sample(7 : 100, size = N, replace = TRUE),
    interval = interval(start, end))

# Use the %within% command to calculate the number of active items per date
queue_history = tibble(Date = seq(min(id_dates$start), max(id_dates$end), by = "1 day")) %>% 
  rowwise() %>% 
  mutate(numInWIP = sum(Date %within% id_dates$interval)) %>%
  ungroup()

以下是一些基准测试结果,表明lubridate解决方案比当前答案和map-reduce方法都要快得多

library(tidyverse)
library(lubridate)

# Initialize a dataframe with start and end "active" dates per object
N = 1000
id_dates = tibble(id = 1 : N) %>%
  mutate(
    start = sample(seq(as.Date('2018-1-1'), as.Date('2019-1-1'), by = "day"), size = N, replace = TRUE),
    end   = start + sample(7 : 100, size = N, replace = TRUE),
    interval = interval(start, end))

# a map-reduce solution
method_mapreduce = function() {
  queue_history = as.tibble(table(reduce(map2(id_dates$start, id_dates$end, seq, by = 1), c)))
  queue_history = queue_history %>%
    rename(Date = Var1, numInWIP = Freq) %>%
    mutate(Date = as_date(Date))

  return (queue_history)
}

# a lubridate interval solution
method_intervals = function() {
  date_df = tibble(Date = seq(min(id_dates$start), max(id_dates$end), by = "1 day"))
  queue_history = date_df %>% 
    rowwise() %>% 
    mutate(numInWIP = sum(Date %within% id_dates$interval))

  return (queue_history)
}

# current best answer
method_currentsolution = function() {
  date_df = tibble(Date = seq(min(id_dates$start), max(id_dates$end), by = "1 day"))
  queue_history = merge(id_dates, date_df, all=TRUE) %>% 
    filter(Date >= start, Date <= end) %>%
    group_by(Date) %>% 
    summarise(n())

}

# Compare with benchmarks
tst = microbenchmark::microbenchmark(
  method_mapreduce(),
  method_intervals(),
  method_currentsolution(),
  times = 5)

microbenchmark::autoplot.microbenchmark(tst) +
  scale_y_log10(
    name   = sprintf("Time [%s]", attr(summary(tst), "unit")),
    breaks = scales::trans_breaks("log10", function(x) round(10^x)))


1
大家好,很抱歉重新激活这篇旧帖子,但我遇到了与OP类似的问题,并想要实现你的代码。然而,我需要找到每年记录的数量(从开始和结束日期)。我该如何调整你的方法来实现这个目标呢?非常感谢。 - daltoncito5034

2

首先,您需要获取所有至少有一个活动项的日期,然后计算每天活动项目的数量。如果我们将您的数据存储在itemDates中,则可以处理它:

dates <- min(itemDates$StartDate) + days(0:as.numeric(max(itemDates$EndDate) - min(itemDates$StartDate)))
dateCounts <- data.frame(
    row.names=dates,
    counts=sapply(dates, function(date)
        sum(date >= itemDates$StartDate & date <= itemDates$EndDate)))

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接