如何在重叠的时间段内进行平均？

Question

如何在重叠的时间段内进行平均？

3

我最近发布了一个类似的问题在这里，但那个问题有点过于简单。因此，我们再来一次：

假设我有一个数据框（以下是dput输出），其中包含许多不同变量（例如此例中的5个，在实际数据中还有更多）的时间序列数据：

          date          a  b  c  d  e
1  2009-10-01 00:00:00 10 20 30 40 50
2  2009-10-01 01:00:00 11 21 31 41 51
3  2009-10-01 02:00:00 12 22 32 42 52
4  2009-10-01 03:00:00 13 23 33 43 53
5  2009-10-01 04:00:00 14 24 34 44 54
6  2009-10-01 05:00:00 15 25 35 45 55
7  2009-10-01 06:00:00 16 26 36 46 56
8  2009-10-01 07:00:00 17 27 37 47 57
9  2009-10-01 08:00:00 18 28 38 48 58
10 2009-10-01 09:00:00 19 29 39 49 59
11 2009-10-01 10:00:00 20 30 40 50 60
12 2009-10-01 11:00:00 21 31 41 51 61
13 2009-10-01 12:00:00 22 32 42 52 62
14 2009-10-01 13:00:00 23 33 43 53 63
15 2009-10-01 14:00:00 24 34 44 54 64
16 2009-10-01 15:00:00 25 35 45 55 65
17 2009-10-01 16:00:00 26 36 46 56 66
18 2009-10-01 17:00:00 27 37 47 57 67
19 2009-10-01 18:00:00 28 38 48 58 68
20 2009-10-01 19:00:00 29 39 49 59 69
21 2009-10-01 20:00:00 30 40 50 60 70
22 2009-10-01 21:00:00 31 41 51 61 71
23 2009-10-01 22:00:00 32 42 52 62 72
24 2009-10-01 23:00:00 33 43 53 63 73
25 2009-10-02 00:00:00 34 44 54 64 74

还有另一个数据框“events”，通过开始和结束日期（这里是3个，实际数据中可能有更多）定义了不同的时间段：

   id       start                stop
1 AGH 2009-10-01 02:00:00 2009-10-01 04:00:00
2 TRG 2009-10-01 03:00:00 2009-10-01 10:00:00
3 ZUH 2009-10-01 03:00:00 2009-10-01 20:00:00

我希望能够获得一个不同事件中变量平均值的表格，如下所示：

   id avg(y.a) avg(y.b) avg(y.c) avg(y.d) avg(y.e)
1 AGH     13.0     23.0     33.0     43.0     53.0
2 TRG     16.5     26.5     36.5     46.5     56.5
3 ZUH     21.5     31.5     41.5     51.5     61.5

我从之前的帖子中学到，我可以使用sqldf包和一个相当简单的SQL语句来实现这个功能：

means <- sqldf("
+     SELECT x.id, avg(y.a), avg(y.b), avg(y.c), avg(y.d), avg(y.e) 
+     FROM events as x, data as y 
+     WHERE y.date between x.start and x.stop 
+     GROUP BY x.id 
+ ")

然而，由于真实数据包含许多需要平均的列，在我处理的各个文件中命名方式也不同，将所有列名输入SQL语句有点繁琐。因此，我更喜欢在R中找到解决方案，可以通过数字（data [2：100]）简单地引用列。然而，困难在于时间段是不连续和重叠的，并且id是字符字符串。如何实现这一点的任何想法都将不胜感激！ dput（data）

structure(list(date = structure(c(1254348000, 1254351600, 1254355200, 
1254358800, 1254362400, 1254366000, 1254369600, 1254373200, 1254376800, 
1254380400, 1254384000, 1254387600, 1254391200, 1254394800, 1254398400, 
1254402000, 1254405600, 1254409200, 1254412800, 1254416400, 1254420000, 
1254423600, 1254427200, 1254430800, 1254434400), class = c("POSIXct", 
"POSIXt"), tzone = "Europe/Berlin"), a = 10:34, b = 20:44, c = 30:54, 
    d = 40:64, e = 50:74), .Names = c("date", "a", "b", "c", 
"d", "e"), row.names = c(NA, -25L), class = "data.frame")

将事件传输到dput

structure(list(id = structure(1:3, .Label = c("AGH", "TRG", "ZUH"
), class = "factor"), start = structure(c(1254355200, 1254358800, 
1254358800), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"), 
    stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct", 
    "POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start", 
"stop"), row.names = c(NA, -3L), class = "data.frame")

- cin

2个回答

1

>  t( sapply(events$id , function(id) lapply (
            data[ data[["date"]] >= events[ events[['id']]==id, 'start'] & 
                  data[["date"]] <= events[ events[['id']]==id, 'stop' ] ,  -1 ], 
            mean) ) )
     a    b    c    d    e   
[1,] 13   23   33   43   53  
[2,] 16.5 26.5 36.5 46.5 56.5
[3,] 21.5 31.5 41.5 51.5 61.5
#  Or if you prefer:
t( sapply(events$id , function(id) data.frame( 
                       id=as.character(id), 
                       lapply (data[ data[["date"]] >= events[events[['id']]==id, 'start'] &  
                                     data[["date"]] <= events[ events[['id']]==id, 'stop' ] , -1 ],
                               mean) ,stringsAsFactors=FALSE) ) )
     id    a    b    c    d    e   
[1,] "AGH" 13   23   33   43   53  
[2,] "TRG" 16.5 26.5 36.5 46.5 56.5
[3,] "ZUH" 21.5 31.5 41.5 51.5 61.5

- IRTFM

谢谢您的快速回复！看起来很有前途，但我需要更多时间来尝试并完全理解它…有可能我可以在第一列中像我的结果表格中那样拥有id名称吗？ - cin

@cin将输出转换为data.frame格式，就这样。 - Roman Luštrik

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- G. Grothendieck · Accepted Answer

The basic problem is due to the fact that the data is not normalized; however, short of putting it into long form we could dynamically generate the sql statement:

library(sqldf)
sql <- paste("select id, ", 
    toString(sprintf("avg(y.%s)", names(data)[-1])),
    "from events as x, data as y
    where y.date between x.start and x.stop
    group by x.id")
sqldf(sql)

As an alternative, we show the use of melt in the reshape2 package to convert the data to long form, data_long, process it to give means.long and convert it back to wide form using dcast :

library(reshape2)
data_long <- melt(data, id.vars = "date")
means_long <- sqldf("
     SELECT x.id, y.variable, avg(value)
     FROM events as x, data_long as y 
     WHERE y.date between x.start and x.stop 
     GROUP BY x.id, y.variable
")
means <- dcast(id ~ variable, data = means_long, value.var = "avg(value)")