将日期序列拆分为每个月一个块（包含开始和结束日期）

Question

将日期序列拆分为每个月一个块（包含开始和结束日期）

3

假设我有一个如下的数据框：

df <- data.frame(group = c("a", "a", "b"),
                 start = as.Date(c("2018-01-01", "2018-09-01", "2018-02-01")),
                 end = as.Date(c("2018-02-15", "2018-12-31", "2018-03-30")))

group      start        end
     a 2018-01-01 2018-02-15
     a 2018-09-01 2018-12-31
     b 2018-02-01 2018-03-30

我希望您能提供以下期望的输出结果：

output <- data.frame(group = c("a", "a", "a", "a", "a", "a", "b", "b"),
                  start = as.Date(c("2018-01-01", "2018-02-01", "2018-09-01",
                                    "2018-10-01", "2018-11-01", "2018-12-01",
                                    "2018-02-01", "2018-03-01")),
                  end = as.Date(c("2018-01-31", "2018-02-15", "2018-09-30",
                                  "2018-10-31", "2018-11-30", "2018-12-31",
                                  "2018-02-28", "2018-03-30")))

 group      start        end
     a 2018-01-01 2018-01-31
     a 2018-02-01 2018-02-15
     a 2018-09-01 2018-09-30
     a 2018-10-01 2018-10-31
     a 2018-11-01 2018-11-30
     a 2018-12-01 2018-12-31
     b 2018-02-01 2018-02-28
     b 2018-03-01 2018-03-30

对于序列中的每个月份，我希望获得一个单独的行，该行将由以下内容分隔：1）如果序列的开始日期大于该月份的开始日期或该月份的开始日期，以及2）如果该月份的结束日期大于序列的结束日期或序列的结束日期，则为该月份的结束日期。你有什么想法如何做到这一点吗？

- arg0naut91

3个回答

2

data.table解决方案

在这种情况下，我最喜欢使用的工具是的非常快速的foverlaps

df <- data.frame(group = c("a", "a", "b"),
                 start = as.Date(c("2018-01-01", "2018-09-01", "2018-02-01")),
                 end = as.Date(c("2018-02-15", "2018-12-31", "2018-03-30")))

#create data-frame with from-to by month
df2 <- data.frame( start = seq( as.Date("2018-01-01"), length = 12, by = "1 month" ),
                   end = seq( as.Date( "2018-02-01"), length = 12, by= "1 month" ) - 1,
                   stringsAsFactors = FALSE )

library(data.table)

#setDT on both data.frames... df2 needs to be keyed in order for foverlaps to work.
dt <- foverlaps( setDT( df ), setDT( df2, key = c("start", "end") ), type = "any", mult = "all" )[]
#choose keep the right columns (start/end)
dt[ start < i.start, start := i.start ]
dt[ end > i.end, end := i.end ]
#cleaning
dt[, `:=`(i.start = NULL, i.end = NULL)][]

 #         start        end group
# 1: 2018-01-01 2018-01-31     a
# 2: 2018-02-01 2018-02-15     a
# 3: 2018-09-01 2018-09-30     a
# 4: 2018-10-01 2018-10-31     a
# 5: 2018-11-01 2018-11-30     a
# 6: 2018-12-01 2018-12-31     a
# 7: 2018-02-01 2018-02-28     b
# 8: 2018-03-01 2018-03-30     b

基准测试

与@AntoniosK的tidyverse解决方案相比（其效果同样好，更易读;-)），foverlaps的工作时间缩短了50％。

# Unit: milliseconds
# expr       min       lq      mean    median        uq       max neval
# tidyverse 10.418585 10.79064 12.531207 11.080309 11.753030 93.110804   100
# foverlaps  5.320911  5.59506  5.861865  5.846766  6.009146  9.606981   100

- Wimpel

谢谢 - 但是快速浏览输出似乎与预期不符？ - arg0naut91

1

@arg0naut 哎呀……手指打错了……已经修正了。请查看更新后的答案。 - Wimpel

1

这是另一种可能的data.table方法：

library(data.table)
setDT(df)

#to create a data.table of monthly periods
earliest <- as.POSIXlt(df[,min(start)]) 
earliest$mday <- 1L
earliest <- as.Date(earliest)

latest <- as.POSIXlt(df[,max(end)])
latest$mday <- 1L
latest <- seq(as.Date(latest), by="1 month", length.out=2L)[2L]

startOfMonths <- seq(earliest, latest, by="1 month")
monthsDT <- data.table(
    som=startOfMonths[-length(startOfMonths)],
    eom=startOfMonths[-1L] - 1L)

#perform non-equi join where som falls within start and end
ans <- monthsDT[df, .(group, start, som=x.som, end, eom=x.eom), 
    by=.EACHI, on=.(som>=start, som<=end)][, -(1L:2L)]

#get desired output according to OP's requirement
ans[, .(group, start=max(start, som), end=min(end, eom)), by=seq_len(ans[,.N])][, -1L]

输出：

   group      start        end
1:     a 2018-01-01 2018-01-31
2:     a 2018-02-01 2018-02-15
3:     a 2018-09-01 2018-09-30
4:     a 2018-10-01 2018-10-31
5:     a 2018-11-01 2018-11-30
6:     a 2018-12-01 2018-12-31
7:     b 2018-02-01 2018-02-28
8:     b 2018-03-01 2018-03-30

- chinsoon12

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- AntoniosK · Accepted Answer

df <- data.frame(group = c("a", "a", "b"),
                 start = as.Date(c("2018-01-01", "2018-09-01", "2018-02-01")),
                 end = as.Date(c("2018-02-15", "2018-12-31", "2018-03-30")))

library(tidyverse)
library(lubridate)

df %>%
  group_by(id = row_number()) %>%             # for each row
  mutate(seq = list(seq(start, end, "day")),  # create a sequence of dates with 1 day step
         month = map(seq, month)) %>%         # get the month for each one of those dates in sequence
  unnest() %>%                                # unnest data
  group_by(group, id, month) %>%              # for each group, row and month
  summarise(start = min(seq),                 # get minimum date as start
            end = max(seq)) %>%               # get maximum date as end
  ungroup() %>%                               # ungroup
  select(-id, - month)                        # remove unecessary columns

# # A tibble: 8 x 3
#   group start      end       
#  <fct> <date>     <date>    
# 1 a     2018-01-01 2018-01-31
# 2 a     2018-02-01 2018-02-15
# 3 a     2018-09-01 2018-09-30
# 4 a     2018-10-01 2018-10-31
# 5 a     2018-11-01 2018-11-30
# 6 a     2018-12-01 2018-12-31
# 7 b     2018-02-01 2018-02-28
# 8 b     2018-03-01 2018-03-30