如何在重叠的时间段内进行平均?

3

我最近发布了一个类似的问题在这里,但那个问题有点过于简单。因此,我们再来一次:

假设我有一个数据框(以下是dput输出),其中包含许多不同变量(例如此例中的5个,在实际数据中还有更多)的时间序列数据:

          date          a  b  c  d  e
1  2009-10-01 00:00:00 10 20 30 40 50
2  2009-10-01 01:00:00 11 21 31 41 51
3  2009-10-01 02:00:00 12 22 32 42 52
4  2009-10-01 03:00:00 13 23 33 43 53
5  2009-10-01 04:00:00 14 24 34 44 54
6  2009-10-01 05:00:00 15 25 35 45 55
7  2009-10-01 06:00:00 16 26 36 46 56
8  2009-10-01 07:00:00 17 27 37 47 57
9  2009-10-01 08:00:00 18 28 38 48 58
10 2009-10-01 09:00:00 19 29 39 49 59
11 2009-10-01 10:00:00 20 30 40 50 60
12 2009-10-01 11:00:00 21 31 41 51 61
13 2009-10-01 12:00:00 22 32 42 52 62
14 2009-10-01 13:00:00 23 33 43 53 63
15 2009-10-01 14:00:00 24 34 44 54 64
16 2009-10-01 15:00:00 25 35 45 55 65
17 2009-10-01 16:00:00 26 36 46 56 66
18 2009-10-01 17:00:00 27 37 47 57 67
19 2009-10-01 18:00:00 28 38 48 58 68
20 2009-10-01 19:00:00 29 39 49 59 69
21 2009-10-01 20:00:00 30 40 50 60 70
22 2009-10-01 21:00:00 31 41 51 61 71
23 2009-10-01 22:00:00 32 42 52 62 72
24 2009-10-01 23:00:00 33 43 53 63 73
25 2009-10-02 00:00:00 34 44 54 64 74

还有另一个数据框“events”,通过开始和结束日期(这里是3个,实际数据中可能有更多)定义了不同的时间段:

   id       start                stop
1 AGH 2009-10-01 02:00:00 2009-10-01 04:00:00
2 TRG 2009-10-01 03:00:00 2009-10-01 10:00:00
3 ZUH 2009-10-01 03:00:00 2009-10-01 20:00:00

我希望能够获得一个不同事件中变量平均值的表格,如下所示:
   id avg(y.a) avg(y.b) avg(y.c) avg(y.d) avg(y.e)
1 AGH     13.0     23.0     33.0     43.0     53.0
2 TRG     16.5     26.5     36.5     46.5     56.5
3 ZUH     21.5     31.5     41.5     51.5     61.5

我从之前的帖子中学到,我可以使用sqldf包和一个相当简单的SQL语句来实现这个功能:

means <- sqldf("
+     SELECT x.id, avg(y.a), avg(y.b), avg(y.c), avg(y.d), avg(y.e) 
+     FROM events as x, data as y 
+     WHERE y.date between x.start and x.stop 
+     GROUP BY x.id 
+ ")

然而,由于真实数据包含许多需要平均的列,在我处理的各个文件中命名方式也不同,将所有列名输入SQL语句有点繁琐。因此,我更喜欢在R中找到解决方案,可以通过数字(data [2:100])简单地引用列。然而,困难在于时间段是不连续和重叠的,并且id是字符字符串。如何实现这一点的任何想法都将不胜感激! dput(data)
structure(list(date = structure(c(1254348000, 1254351600, 1254355200, 
1254358800, 1254362400, 1254366000, 1254369600, 1254373200, 1254376800, 
1254380400, 1254384000, 1254387600, 1254391200, 1254394800, 1254398400, 
1254402000, 1254405600, 1254409200, 1254412800, 1254416400, 1254420000, 
1254423600, 1254427200, 1254430800, 1254434400), class = c("POSIXct", 
"POSIXt"), tzone = "Europe/Berlin"), a = 10:34, b = 20:44, c = 30:54, 
    d = 40:64, e = 50:74), .Names = c("date", "a", "b", "c", 
"d", "e"), row.names = c(NA, -25L), class = "data.frame")

将事件传输到dput

structure(list(id = structure(1:3, .Label = c("AGH", "TRG", "ZUH"
), class = "factor"), start = structure(c(1254355200, 1254358800, 
1254358800), class = c("POSIXct", "POSIXt"), tzone = "Europe/Berlin"), 
    stop = structure(c(1254362400, 1254384000, 1254420000), class = c("POSIXct", 
    "POSIXt"), tzone = "Europe/Berlin")), .Names = c("id", "start", 
"stop"), row.names = c(NA, -3L), class = "data.frame")
2个回答

2
  1. The basic problem is due to the fact that the data is not normalized; however, short of putting it into long form we could dynamically generate the sql statement:

    library(sqldf)
    sql <- paste("select id, ", 
        toString(sprintf("avg(y.%s)", names(data)[-1])),
        "from events as x, data as y
        where y.date between x.start and x.stop
        group by x.id")
    sqldf(sql)
    
  2. As an alternative, we show the use of melt in the reshape2 package to convert the data to long form, data_long, process it to give means.long and convert it back to wide form using dcast :

    library(reshape2)
    data_long <- melt(data, id.vars = "date")
    means_long <- sqldf("
         SELECT x.id, y.variable, avg(value)
         FROM events as x, data_long as y 
         WHERE y.date between x.start and x.stop 
         GROUP BY x.id, y.variable
    ")
    means <- dcast(id ~ variable, data = means_long, value.var = "avg(value)")
    

我并不真正理解为什么数据应该被规范化,但sprintf是一个有趣的函数。谢谢! - cin
已经添加了一个将其转换为长格式(标准化)并在该格式中处理,然后再转换回宽格式的示例。 - G. Grothendieck
@cin:我认为sqldf函数比老掉牙的sprintf有趣得多。(G. Grothendieck是一位比我更有能力的R程序员,他编写了该软件包...需要加载。 - IRTFM
@G.Grothendieck:我现在明白了。这是另一种不错的方法。我刚开始发现 R 的强大之处。谢谢! - cin
@DWin:我已经知道sqldf包了。我同意它非常好用。不过,由于我并不是程序员,我猜还有很多其他有趣的“普通”功能等待我去发现。 - cin

1
>  t( sapply(events$id , function(id) lapply (
            data[ data[["date"]] >= events[ events[['id']]==id, 'start'] & 
                  data[["date"]] <= events[ events[['id']]==id, 'stop' ] ,  -1 ], 
            mean) ) )
     a    b    c    d    e   
[1,] 13   23   33   43   53  
[2,] 16.5 26.5 36.5 46.5 56.5
[3,] 21.5 31.5 41.5 51.5 61.5
#  Or if you prefer:
t( sapply(events$id , function(id) data.frame( 
                       id=as.character(id), 
                       lapply (data[ data[["date"]] >= events[events[['id']]==id, 'start'] &  
                                     data[["date"]] <= events[ events[['id']]==id, 'stop' ] , -1 ],
                               mean) ,stringsAsFactors=FALSE) ) )
     id    a    b    c    d    e   
[1,] "AGH" 13   23   33   43   53  
[2,] "TRG" 16.5 26.5 36.5 46.5 56.5
[3,] "ZUH" 21.5 31.5 41.5 51.5 61.5

谢谢您的快速回复!看起来很有前途,但我需要更多时间来尝试并完全理解它…有可能我可以在第一列中像我的结果表格中那样拥有id名称吗? - cin
@cin将输出转换为data.frame格式,就这样。 - Roman Luštrik

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接