在R中将每日数据聚合成每周数据

4

我有一个类似于以下可复制示例数据的大型数据集。

   Interval    value
1  2012-06-10   552
2  2012-06-11  4850
3  2012-06-12  4642
4  2012-06-13  4132
5  2012-06-14  4190
6  2012-06-15  4186
7  2012-06-16  1139
8  2012-06-17   490
9  2012-06-18  5156
10 2012-06-19  4430
11 2012-06-20  4447
12 2012-06-21  4256
13 2012-06-22  3856
14 2012-06-23  1163
15 2012-06-24   564
16 2012-06-25  4866
17 2012-06-26  4421
18 2012-06-27  4206
19 2012-06-28  4272
20 2012-06-29  3993
21 2012-06-30  1211
22 2012-07-01   698
23 2012-07-02  5770
24 2012-07-03  5103
25 2012-07-04   775
26 2012-07-05  5140
27 2012-07-06  4868
28 2012-07-07  1225
29 2012-07-08   671
30 2012-07-09  5726
31 2012-07-10  5176

我想将这些数据聚合到每周级别,以获得类似以下输出的结果:
   Interval           value
1  Week 2, June 2012  *aggregate value for day 10 to day 14 of June 2012*
2  Week 3, June 2012  *aggregate value for day 15 to day 21 of June 2012*
3  Week 4, June 2012  *aggregate value for day 22 to day 28 of June 2012*
4  Week 5, June 2012  *aggregate value for day 29 to day 30 of June 2012*
5  Week 1, July 2012  *aggregate value for day 1 to day 7 of July 2012*
6  Week 2, July 2012  *aggregate value for day 8 to day 10 of July 2012*

如何轻松实现此功能而不编写冗长的代码?

你使用了 [xts] 标签,但看起来你没有 xts 对象。不过你是对的,xts 可能是最简单的方法。你有搜索过吗?可以查看 to.weeklyapply.weeklyperiod.apply,并在 Stack Overflow 上搜索相同的内容。 - GSee
1
需要注意的是,OP想要按“月份周数”进行聚合——而不是按“年份周数”。下面大多数答案都是基于年份周数的。 - Uwe
6个回答

17

如果您指的是按周累加“value”的总和,我认为最简单的方法是按照GSee的建议将数据转换为xts对象:

data <- as.xts(data$value,order.by=as.Date(data$interval))
weekly <- apply.weekly(data,sum)

            [,1]
2012-06-10   552
2012-06-17 23629
2012-06-24 23872
2012-07-01 23667
2012-07-08 23552
2012-07-10 10902

我将输出的格式留给您作为练习 :-)


我现在该如何切换到ts()对象以便使用forecast和decompose函数? - gmeroni
使用 "as" 方法:as.ts(data) - hvollmeier

6

我刚刚发现这个旧问题,因为它被用作重复目标。

不幸的是,所有赞成的答案(除了konvas的那个一个已经删除的)都提供了按年周聚合数据的解决方案,而OP请求按月周聚合。

周年和周月的定义是模糊的,如此���此处此处所述。

然而,OP指出他想要将每个月的第1到7天计为该月的第1周,第8到14天计为该月的第2周,依此类推。请注意,大多数月份的第5周只有2或3天(除非是二月份没有闰年)。

准备好了,这里是这种聚合的data.table解决方案:

library(data.table)
DT[, .(value = sum(value)), 
       by = .(Interval = sprintf("Week %i, %s", 
                                 (mday(Interval) - 1L) %/% 7L + 1L, 
                                 format(Interval, "%b %Y")))]
           Interval value
1: Week 2, Jun 2012 18366
2: Week 3, Jun 2012 24104
3: Week 4, Jun 2012 23348
4: Week 5, Jun 2012  5204
5: Week 1, Jul 2012 23579
6: Week 2, Jul 2012 11573
我们可以通过验证来确定我们选择的间隔是否正确。
DT[, .(value = sum(value),
       date_range = toString(range(Interval))), 
   by = .(Week = sprintf("Week %i, %s", 
                             (mday(Interval) -1L) %/% 7L + 1L, 
                             format(Interval, "%b %Y")))]
               Week value             date_range
1: Week 2, Jun 2012 18366 2012-06-10, 2012-06-14
2: Week 3, Jun 2012 24104 2012-06-15, 2012-06-21
3: Week 4, Jun 2012 23348 2012-06-22, 2012-06-28
4: Week 5, Jun 2012  5204 2012-06-29, 2012-06-30
5: Week 1, Jul 2012 23579 2012-07-01, 2012-07-07
6: Week 2, Jul 2012 11573 2012-07-08, 2012-07-10

which is in line with OP's specification.

Data

library(data.table)
DT <- fread(
  "rn   Interval    value
  1  2012-06-10   552
  2  2012-06-11  4850
  3  2012-06-12  4642
  4  2012-06-13  4132
  5  2012-06-14  4190
  6  2012-06-15  4186
  7  2012-06-16  1139
  8  2012-06-17   490
  9  2012-06-18  5156
  10 2012-06-19  4430
  11 2012-06-20  4447
  12 2012-06-21  4256
  13 2012-06-22  3856
  14 2012-06-23  1163
  15 2012-06-24   564
  16 2012-06-25  4866
  17 2012-06-26  4421
  18 2012-06-27  4206
  19 2012-06-28  4272
  20 2012-06-29  3993
  21 2012-06-30  1211
  22 2012-07-01   698
  23 2012-07-02  5770
  24 2012-07-03  5103
  25 2012-07-04   775
  26 2012-07-05  5140
  27 2012-07-06  4868
  28 2012-07-07  1225
  29 2012-07-08   671
  30 2012-07-09  5726
  31 2012-07-10  5176", drop = 1L)
DT[, Interval := as.Date(Interval)]

5

如果你使用 lubridate 中的 week,你只能得到五周传递给 by。假设 dat 是你的数据。

> library(lubridate)
> do.call(rbind, by(dat$value, week(dat$Interval), summary))
#    Min. 1st Qu. Median Mean 3rd Qu. Max.
# 24  552    4146   4188 3759    4529 4850
# 25  490    2498   4256 3396    4438 5156
# 26  564    2578   4206 3355    4346 4866
# 27  698     993   4868 3366    5122 5770
# 28  671    1086   3200 3200    5314 5726

这显示了本年度第24周至28周的摘要。同样,我们可以使用聚合来获取平均值。
> aggregate(value~week(Interval), data = dat, mean)
#   week(Interval)    value
# 1             24 3758.667
# 2             25 3396.286
# 3             26 3355.000
# 4             27 3366.429
# 5             28 3199.500

4
如果您使用数据框架,您可以轻松使用tidyquant包来完成此操作。使用tq_transmute函数应用突变并返回新的数据框架。选择“value”列并应用xts函数apply.weekly。额外参数FUN = sum将按周聚合。
library(tidyquant)

df
#> # A tibble: 31 x 2
#>      Interval value
#>        <date> <int>
#>  1 2012-06-10   552
#>  2 2012-06-11  4850
#>  3 2012-06-12  4642
#>  4 2012-06-13  4132
#>  5 2012-06-14  4190
#>  6 2012-06-15  4186
#>  7 2012-06-16  1139
#>  8 2012-06-17   490
#>  9 2012-06-18  5156
#> 10 2012-06-19  4430
#> # ... with 21 more rows

df %>%
    tq_transmute(select     = value,
                 mutate_fun = apply.weekly,
                 FUN        = sum)
#> # A tibble: 6 x 2
#>     Interval value
#>       <date> <int>
#> 1 2012-06-10   552
#> 2 2012-06-17 23629
#> 3 2012-06-24 23872
#> 4 2012-07-01 23667
#> 5 2012-07-08 23552
#> 6 2012-07-10 10902

0

当你说“聚合”值时,是指将它们相加吗?假设你的数据框是d,并且假设d$Interval的类别是Date,你可以尝试以下代码:

# if d$Interval is not of class Date d$Interval <- as.Date(d$Interval)
formatdate <- function(date)
    paste0("Week ", (as.numeric(format(date, "%d")) - 1) + 1,
        ", ", format(date, "%b %Y"))
# change "sum" to your required function
aggregate(d$value, by = list(formatdate(d$Interval)), sum)
#            Group.1        x
# 1 Week 1, Jul 2012 3725.667
# 2 Week 2, Jul 2012 3199.500
# 3 Week 2, Jun 2012 3544.000
# 4 Week 3, Jun 2012 3434.000
# 5 Week 4, Jun 2012 3333.143
# 6 Week 5, Jun 2012 3158.667

0
一个简单的常规聚合方式是在任意时间单位上,只需计算时间与最小时间之间的时间差,截断它并加到最小时间上。
library(lubridate)
index <- min(data$Interval)

units <- "weeks"
num <- 1

unit_duration <- do.call(duration, list(units))

time_diff <- (interval(index, data$Interval) / unit_duration) / num

index + do.call(duration, setNames(list(trunc(time_diff) * num), units))
 [1] "2012-06-10" "2012-06-10" "2012-06-10" "2012-06-10" "2012-06-10" "2012-06-10" "2012-06-10" "2012-06-17" "2012-06-17"
[10] "2012-06-17" "2012-06-17" "2012-06-17" "2012-06-17" "2012-06-17" "2012-06-24" "2012-06-24" "2012-06-24" "2012-06-24"
[19] "2012-06-24" "2012-06-24" "2012-06-24" "2012-07-01" "2012-07-01" "2012-07-01" "2012-07-01" "2012-07-01" "2012-07-01"
[28] "2012-07-01" "2012-07-08" "2012-07-08" "2012-07-08"

对于整周,您可以在索引上使用lubridate::floor_date()

index <- lubridate::floor_date(index, unit = "week", week_start = 1)
time_diff <- (interval(index, data$Interval) / unit_duration) / num

time_agg <- index + do.call(duration, setNames(list(trunc(time_diff) * num), units))
time_agg
 [1] "2012-06-04" "2012-06-11" "2012-06-11" "2012-06-11" "2012-06-11" "2012-06-11" "2012-06-11" "2012-06-11" "2012-06-18"
[10] "2012-06-18" "2012-06-18" "2012-06-18" "2012-06-18" "2012-06-18" "2012-06-18" "2012-06-25" "2012-06-25" "2012-06-25"
[19] "2012-06-25" "2012-06-25" "2012-06-25" "2012-06-25" "2012-07-02" "2012-07-02" "2012-07-02" "2012-07-02" "2012-07-02"
[28] "2012-07-02" "2012-07-02" "2012-07-09" "2012-07-09"

您可以将该向量用作群组,以计算任何摘要。

library(dplyr)
data %>%
  mutate(time_agg) %>%
  group_by(time_agg) %>%
  summarise(sum = sum(value))
# A tibble: 6 x 2
  time_agg     sum
  <date>     <int>
1 2012-06-04   552
2 2012-06-11 23629
3 2012-06-18 23872
4 2012-06-25 23667
5 2012-07-02 23552
6 2012-07-09 10902

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接