R: 根据ID计算距离上次事件的天数

3

我想知道自上次事件以来每个ID的天数。 数据如下:

df <- data.frame(date=as.Date(
c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001",
"06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"), "%d/%m/%Y"), 
event=c(0,0,1,0,1, 1,0,0,0,1),id = c(rep(1,5),rep(2,5)))

         date event id
1  2000-07-06     0  1
2  2000-09-15     0  1
3  2000-10-15     1  1
4  2001-01-03     0  1
5  2001-03-17     1  1
6  2010-08-06     1  2
7  2010-09-15     0  2
8  2010-10-15     0  2
9  2011-01-03     0  2
10 2011-03-17     1  2

我在很大程度上借鉴了一个数据表解决方案(此处),但这并未考虑到ID。

library(data.table)
setDT(df)
setkey(df, date,id)

df = df[event == 1, .(lastevent = date), key = date][df, roll = TRUE]
df[, tae := difftime(lastevent, shift(lastevent, 1L, "lag"), unit = "days")]
df[event == 0, tae:= difftime(date, lastevent, unit = "days")]

它会生成以下输出。
          date  lastevent event id       tae
 1: 2000-07-06       <NA>     0  1   NA days
 2: 2000-09-15       <NA>     0  1   NA days
 3: 2000-10-15 2000-10-15     1  1   NA days
 4: 2001-01-03 2000-10-15     0  1   80 days
 5: 2001-03-17 2001-03-17     1  1  153 days
 6: 2010-08-06 2010-08-06     1  2 3429 days
 7: 2010-09-15 2010-08-06     0  2   40 days
 8: 2010-10-15 2010-08-06     0  2   70 days
 9: 2011-01-03 2010-08-06     0  2  150 days
10: 2011-03-17 2011-03-17     1  2  223 days

然而,我希望输出如下:

          date  lastevent event id       tae
 1: 2000-07-06       <NA>     0  1   NA days
 2: 2000-09-15       <NA>     0  1   NA days
 3: 2000-10-15 2000-10-15     1  1   NA days
 4: 2001-01-03 2000-10-15     0  1   80 days
 5: 2001-03-17 2001-03-17     1  1  153 days
 6: 2010-08-06 2010-08-06     1  2   NA days
 7: 2010-09-15 2010-08-06     0  2   40 days
 8: 2010-10-15 2010-08-06     0  2   70 days
 9: 2011-01-03 2010-08-06     0  2  150 days
10: 2011-03-17 2011-03-17     1  2  223 days    

唯一的区别在于第6行和列tae中的NA。 这篇文章是一个相关的未回答的帖子。我已经在这里查找过,但解决方案在我的情况下不起作用。有很多类似的问题,但不是针对每个ID的计算。谢谢!

1个回答

3
df <- data.table(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001","06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"), 
"%d/%m/%Y"), event=c(0,0,1,0,1, 1,0,1,0,1),id = c(rep(1,5),rep(2,5)))

tempdt <- df[event==1,]

tempdt[,tae := date - shift(date), by = id]

df <- merge(df, tempdt, by = c("date", "event", "id"), all.x = TRUE)

df[, tae := ifelse(shift(event)==1, date - shift(date), tae), by = id]

编辑

更通用的解决方案

df <- data.table(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001", "18/03/2001",
                            "06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011","19/03/2011"), 
                          "%d/%m/%Y"), 
             event=c(1,0,0,0,0,0,1,1,1,0,1,0),id = c(rep(1,6),rep(5,6)))

##for event = 1 observations
tempdt <- df[event==1,]

tempdt[,tae := date - shift(date), by = id]

df <- merge(df, tempdt, by = c("date", "event", "id"), all.x = TRUE)

##for event = 0 observations
for(d in df[event==0, date]){
  # print(as.Date(d, origin = "1970-01-01"))
  df[date == d & event == 0, tae := as.Date(d, origin = "1970-01-01") - 
   max(df[date<d & event==1,date]), by = id]  
}

编辑2 现在,可能有更快的方法来完成这个任务,但是如果第一个观察结果是event = 0,那么不会出现任何警告。

df <- data.table(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001","06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"),
                           "%d/%m/%Y"), event=c(0,0,1,0,1, 1,0,0,0,1),id = c(rep(1,5),rep(2,5))) 

tempdt <- df[event==1,] 

tempdt[,tae := date - shift(date), by = id] 

df <- merge(df, tempdt, by = c("date", "event", "id"), all.x = TRUE) 

for(i in unique(df[,id])){
  # print(i)
  for(d in df[date>df[id == i & event==1,min(date)] & event==0, date]){
  # print(as.Date(d, origin = "1970-01-01"))
    df[id == i & date == d & event == 0,
     tae := as.Date(d, origin = "1970-01-01") - max(df[date<d & 
     event==1,date])]
  }  
}

1
如此简单,简直让人疼痛。非常感谢! - HOSS_JFL
只是想提一下,您的代码对于这个数据无法工作:df <- data.frame(date=as.Date(c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001", "18/03/2001","06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011","19/03/2011"), "%d/%m/%Y"), event=c(1,0,0,0,0,0, 1,1,1,0,1,0),id = c(rep(1,6),rep(5,6)))。 - HOSS_JFL
非常感谢您。然而,对于我编辑后的帖子中的数据集,我遇到了一些警告。 "In max.default(numeric(0), na.rm = FALSE) : no non-missing arguments to min; return -Inf ". 它在开始时无法工作... 这是由于数据表引起的吗? - HOSS_JFL
我不明白那些警告是什么意思。你在使用data.table吗? - simone
你好!是的,我使用数据表1.10-4.rm(list = ls())。 df <- data.table(date=as.Date( c("06/07/2000","15/09/2000","15/10/2000","03/01/2001","17/03/2001", "06/08/2010","15/09/2010","15/10/2010","03/01/2011","17/03/2011"), "%d/%m/%Y"), event=c(0,0,1,0,1, 1,0,0,0,1),id = c(rep(1,5),rep(2,5))) tempdt <- df[event==1,] tempdt[,tae := date - shift(date), by = id] df <- merge(df, tempdt, by = c("date", "event", "id"), all.x = TRUE) for(d in df[event==0, date]){ df[date == d & event == 0, tae := as.Date(d, origin = "1970-01-01") - max(df[date<d & event==1,date]), by = id]
}
- HOSS_JFL
这只是因为你正在使用与我不同的df。谢谢! - HOSS_JFL

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接