我有一个data.frame
,其中包含5个变量: day(日期,格式为:“YYYY-MM-DD”),hour(POSIXct,格式为:“YYYY-MM-DD hh:mm:ss”),group(chr),measure_start(numeric)和measure_end(numeric)。
df <- structure(list(
day = structure(c(18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116, 18116), class = "Date"),
hour = structure(c(1565275500, 1565276400, 1565277300, 1565278200, 1565279100, 1565280000, 1565280900, 1565281800, 1565282700, 1565275500, 1565276400, 1565277300, 1565278200, 1565279100, 1565280000, 1565280900, 1565281800, 1565282700), class = c("POSIXct", "POSIXt"), tzone = ""),
group = c("GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP1", "GROUP2", "GROUP2", "GROUP2", "GROUP2", "GROUP2", "GROUP2", "GROUP2", "GROUP2", "GROUP2"),
measure_start = c(2, 3, 3, 2, 4, 5, 7, 8, 7, 15, 16, 32, 20, 21, 40, 15, 13, 22),
measure_end = c(3, 3, 3, 5, 4, 7, 7, 8, 7, 16, 15, 31, 20, 21, 42, 15, 13, 26)),
row.names = c(NA, -18L), class = "data.frame")
对于每一行“i”
data.frame
,我想要获取第一个满足条件“measure_end >= 2 * measure_start_i”的行;但仅限于大于或等于行“i”的小时,并按照相同的day和group分组。
换句话说,对于每个观察值 [day_i, hour_i, group_i, measure_start_i, measure_end_i], 我想要获得: which.min(measure_end >= 2 * measure_start_i | (day == day_i) & (group == group_i) & (hour >= hour_i))。例如,对于上面的示例,预期输出应该是:
day hour group measure_start measure_end row_with_me_2x_current_ms
1 2019-08-08 2019-08-08 11:45:00 GROUP1 2 3 4
2 2019-08-08 2019-08-08 12:00:00 GROUP1 3 3 6
3 2019-08-08 2019-08-08 12:15:00 GROUP1 3 3 6
4 2019-08-08 2019-08-08 12:30:00 GROUP1 2 5 4
5 2019-08-08 2019-08-08 12:45:00 GROUP1 4 4 8
6 2019-08-08 2019-08-08 13:00:00 GROUP1 5 7 NA
7 2019-08-08 2019-08-08 13:15:00 GROUP1 7 7 NA
8 2019-08-08 2019-08-08 13:30:00 GROUP1 8 8 NA
9 2019-08-08 2019-08-08 13:45:00 GROUP1 7 7 NA
10 2019-08-08 2019-08-08 11:45:00 GROUP2 15 16 12
11 2019-08-08 2019-08-08 12:00:00 GROUP2 16 15 15
12 2019-08-08 2019-08-08 12:15:00 GROUP2 32 31 NA
13 2019-08-08 2019-08-08 12:30:00 GROUP2 20 20 15
14 2019-08-08 2019-08-08 12:45:00 GROUP2 21 21 15
15 2019-08-08 2019-08-08 13:00:00 GROUP2 40 42 NA
16 2019-08-08 2019-08-08 13:15:00 GROUP2 15 15 NA
17 2019-08-08 2019-08-08 13:30:00 GROUP2 13 13 18
18 2019-08-08 2019-08-08 13:45:00 GROUP2 22 26 NA
我的data.frame
相当大,所以我猜想data.table
的方法可能最好。然而,我对data.table
语法还不是很熟悉。我下面的尝试没有帮助:
dt = data.table(df)
dt[,row_with_me_2x_current_ms:= which.min(dt[,measure_end] / measure_start >= 2) ,by=.(day,group)]