按组查找每行日期后符合条件的第一行

3

我有一个data.frame,其中包含5个变量: day(日期,格式为:“YYYY-MM-DD”),hour(POSIXct,格式为:“YYYY-MM-DD hh:mm:ss”),group(chr),measure_start(numeric)和measure_end(numeric)。

df <- structure(list(
  day = structure(c(18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116), class = "Date"), 
  hour = structure(c(1565275500, 1565276400, 1565277300, 1565278200, 1565279100, 1565280000, 1565280900, 1565281800, 1565282700, 1565275500, 1565276400, 1565277300, 1565278200, 1565279100, 1565280000, 1565280900, 1565281800, 1565282700), class = c("POSIXct", "POSIXt"), tzone = ""), 
  group = c("GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP2", "GROUP2", "GROUP2", "GROUP2", "GROUP2", "GROUP2", "GROUP2", "GROUP2", "GROUP2"), 
  measure_start = c(2, 3, 3, 2, 4, 5, 7, 8, 7, 15, 16, 32, 20, 21, 40, 15, 13, 22), 
  measure_end = c(3, 3, 3, 5, 4, 7, 7, 8, 7, 16, 15, 31, 20, 21, 42, 15, 13, 26)),
  row.names = c(NA, -18L), class = "data.frame")

对于每一行“i”data.frame,我想要获取第一个满足条件“measure_end >= 2 * measure_start_i”的行;但仅限于大于或等于行“i”的小时,并按照相同的daygroup分组。

换句话说,对于每个观察值 [day_i, hour_i, group_i, measure_start_i, measure_end_i], 我想要获得: which.min(measure_end >= 2 * measure_start_i | (day == day_i) & (group == group_i) & (hour >= hour_i))。
例如,对于上面的示例,预期输出应该是:
          day                hour  group measure_start measure_end      row_with_me_2x_current_ms
1  2019-08-08 2019-08-08 11:45:00 GROUP1             2           3                              4
2  2019-08-08 2019-08-08 12:00:00 GROUP1             3           3                              6
3  2019-08-08 2019-08-08 12:15:00 GROUP1             3           3                              6
4  2019-08-08 2019-08-08 12:30:00 GROUP1             2           5                              4
5  2019-08-08 2019-08-08 12:45:00 GROUP1             4           4                              8
6  2019-08-08 2019-08-08 13:00:00 GROUP1             5           7                             NA
7  2019-08-08 2019-08-08 13:15:00 GROUP1             7           7                             NA
8  2019-08-08 2019-08-08 13:30:00 GROUP1             8           8                             NA
9  2019-08-08 2019-08-08 13:45:00 GROUP1             7           7                             NA
10 2019-08-08 2019-08-08 11:45:00 GROUP2            15          16                             12
11 2019-08-08 2019-08-08 12:00:00 GROUP2            16          15                             15
12 2019-08-08 2019-08-08 12:15:00 GROUP2            32          31                             NA
13 2019-08-08 2019-08-08 12:30:00 GROUP2            20          20                             15
14 2019-08-08 2019-08-08 12:45:00 GROUP2            21          21                             15
15 2019-08-08 2019-08-08 13:00:00 GROUP2            40          42                             NA
16 2019-08-08 2019-08-08 13:15:00 GROUP2            15          15                             NA
17 2019-08-08 2019-08-08 13:30:00 GROUP2            13          13                             18
18 2019-08-08 2019-08-08 13:45:00 GROUP2            22          26                             NA

我的data.frame相当大,所以我猜想data.table的方法可能最好。然而,我对data.table语法还不是很熟悉。我下面的尝试没有帮助:

dt = data.table(df)
dt[,row_with_me_2x_current_ms:= which.min(dt[,measure_end] / measure_start >= 2) ,by=.(day,group)]

你能展示一下预期的输出吗? - akrun
预期输出如上所示。谢谢! - pabc
2个回答

3

这里还有一个选项,可以使用 data.table 的非等连接(non-equi join):

setDT(df)[, c("rn", "twice") := .(.I, 2 * measure_start)]

df[, row_with_me_2x_current_ms := 
    df[.SD, on=.(group, day, hour>=hour, measure_end>=twice), mult="first", rn]
]

0
如果我们想要获取逻辑向量的第一个索引。我们可以按照“day”、“group”进行分组,然后使用lapply循环遍历“measure_start”的序列,对“measure_start”(“mst”)的值进行子集化,除以“measure_end”,获取满足条件的第一行的索引并进行赋值。
library(data.table) 
dt[, row_with_me_2x_current_ms:= 
     unlist(lapply(seq_along(measure_start), function(i) {
        mst <- measure_start[i]
        i2 <- which((measure_end/mst) >=2)
        .I[i2[i2 >= i][1]]})),
   by = .(group, day)]
dt
#           day                hour  group measure_start measure_end row_with_me_2x_current_ms
# 1: 2019-08-08 2019-08-08 09:45:00 GROUP1             2           3                         4
# 2: 2019-08-08 2019-08-08 10:00:00 GROUP1             3           3                         6
# 3: 2019-08-08 2019-08-08 10:15:00 GROUP1             3           3                         6
# 4: 2019-08-08 2019-08-08 10:30:00 GROUP1             2           5                         4
# 5: 2019-08-08 2019-08-08 10:45:00 GROUP1             4           4                         8
# 6: 2019-08-08 2019-08-08 11:00:00 GROUP1             5           7                        NA
# 7: 2019-08-08 2019-08-08 11:15:00 GROUP1             7           7                        NA
# 8: 2019-08-08 2019-08-08 11:30:00 GROUP1             8           8                        NA
# 9: 2019-08-08 2019-08-08 11:45:00 GROUP1             7           7                        NA
#10: 2019-08-08 2019-08-08 09:45:00 GROUP2            15          16                        12
#11: 2019-08-08 2019-08-08 10:00:00 GROUP2            16          15                        15
#12: 2019-08-08 2019-08-08 10:15:00 GROUP2            32          31                        NA
#13: 2019-08-08 2019-08-08 10:30:00 GROUP2            20          20                        15
#14: 2019-08-08 2019-08-08 10:45:00 GROUP2            21          21                        15
#15: 2019-08-08 2019-08-08 11:00:00 GROUP2            40          42                        NA
#16: 2019-08-08 2019-08-08 11:15:00 GROUP2            15          15                        NA
#17: 2019-08-08 2019-08-08 11:30:00 GROUP2            13          13                        18
#18: 2019-08-08 2019-08-08 11:45:00 GROUP2            22          26                        NA

谢谢您的建议,但重要的是我们找到每一行满足条件的第一个索引(而不仅仅是第一个)。请参见上面的预期输出。 - pabc
嗨@akrun。如果我的逻辑不清楚,我很抱歉。输出列应返回行的第一个索引,这些行在(day, group)下分组,并且列"measure_end"的值至少是"measure_start"值的两倍。此计算应针对每一行进行。例如:对于第1行,我们有measure_start == 2。因此返回的值应为"4",这是measure_end >=2的第一行。对于第2行,我们有measure_start == 3,因此返回的值应为"6",这是measure_end >=3的第一行;以此类推... - pabc
@pabc 即 3/5 = 0.6,不等于 >=2。 - akrun
逻辑应该相反:which((dt$measure_end/dt$measure_start[1]) >=2) # 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18。因此,对于这一行,代码将返回第一个元素:"4"。 - pabc
@pabc 我是否遗漏了第17行的其他条件? - akrun
显示剩余7条评论

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接