我有一个数据集,看起来像这样:
ID FromDate ToDate SiteID Cost
1 8/12/2014 8/31/2014 12 245.98
1 9/1/2014 9/7/2014 12 269.35
1 10/10/2014 10/17/2014 12 209.98
1 11/22/2014 11/30/2014 12 309.12
1 12/1/2014 12/11/2014 12 202.14
2 8/16/2014 8/21/2014 12 109.35
2 8/22/2014 8/24/2014 14 44.12
2 9/25/2014 9/29/2014 12 98.75
3 9/15/2014 9/30/2014 23 536.27
3 10/1/2014 10/31/2014 12 529.87
3 11/1/2014 11/30/2014 12 969.55
3 12/1/2014 12/12/2014 12 607.35
What I would like this to look like is:
ID FromDate ToDate SiteID Cost
1 8/12/2014 9/7/2014 12 515.33
1 10/10/2014 10/17/2014 12 209.98
1 11/22/2014 12/11/2014 12 511.26
2 8/16/2014 8/21/2014 12 109.35
2 8/22/2014 8/24/2014 14 44.12
2 9/25/2014 9/29/2014 12 98.75
3 9/15/2014 9/30/2014 23 536.27
3 10/1/2014 12/12/2014 12 2106.77
可以看到,如果有连续的日期,则会将日期合并,并且成本会按ID和SiteID加总。为了帮助某人理解复杂性,如果日期区间有延续,但SiteID发生变化,则它是一个单独的行。如果日期区间没有延续,则是一行单独的数据。在R中如何实现?另外,我有超过100,000个不同的ID。那么,最有效的方式/包应该选择哪一个呢?
df %>% mutate(crit = FromDate-lag(ToDate, default=1)==1, gr = cumsum(crit==FALSE)) %>% group_by(gr, ID, SiteID) %>% summarise(cost = sum(Cost), FromDate = min(FromDate), ToDate = max(ToDate))
- JasonAizkalnsID
进行分组,那么ID
列将被保留。请查看更新的帖子。 - Khashaa