数据表进行滚动连接,根据日期计算平均值的条件。

4
我可以通过多次循环数据集来完成此操作,但认为应该有一种更有效的方法来通过data.table完成此操作。 数据集如下所示:
CaseID         Won     OwnerID      Time_period    Finished
  1            yes        A              1              no
  1            yes        A              3              no
  1            yes        A              5              yes
  2            no         A              4              no
  2            no         A              6              yes
  3            yes        A              2              yes
  4            no         A              3              yes
  5            15         B              2              no

针对每个所有者的行,我想生成一个在该时间段之前完成并赢得的案件数量的平均数。

CaseID         Won     OwnerID      Time_period     Finished     AvgWonByOwner  
  1            yes        A              1              no            NA
  1            yes        A              3              no             1
  1            yes        A              5              yes           .5
  2            no         A              4              no            .5
  2            no         A              6              yes           2/3
  3            yes        A              2              yes           NA
  4            no         A              3              yes           1
  5            15         B              2              no            NA

仔细看来,这似乎非常复杂。我以为可以通过某种滚动合并来完成,但我不知道如何设置条件,使得只有在日期之前赢得的平均值才会被计算,并且必须具有相同的所有者ID。

编辑1:最后一列数字的解释

AvgWonByOwner          Explanation
   NA                  t = 1, No cases finished yet, this could be 0 too
   1                   t = 3, case 3 finished and is won, so average wins is 1
  .5                   t = 5, case 3 finished, won; case 4 finished lost; average = .5
  .5                   t = 4, case 3 finished, won; case 4 finished lost; average = .5
  2/3                  t = 6, case 3 finished, won, case 4 finished lost, case 1 finished won, average: 2/3
   NA                  t = 1, No cases finished yet, this could be 0 too
   1                   t = 3, case 3 finished and is won, so average wins is 1
   NA                  t = 1, No cases finished yet, this could be 0 too

1
你需要解释一下如何得出最后一列中的数字。 - eddi
案例1在t=5时结束,直到t=6才计入平均值。基本上这是一个严格的不等式。 - Luke
@Luke,这真是一个让人费解的问题!+1 - Simon O'Hanlon
2个回答

4
dt = data.table(structure(list(CaseID = c(1, 1, 1, 2, 2, 3, 4, 5), Won = structure(c(3L, 
3L, 3L, 2L, 2L, 3L, 2L, 1L), .Label = c("15", "no", "yes"), class = "factor"), 
    OwnerID = structure(c(1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L), .Label = c("A", 
    "B"), class = "factor"), Time_period = c(1L, 3L, 5L, 4L, 
    6L, 2L, 3L, 2L), Finished = structure(c(1L, 1L, 2L, 1L, 2L, 
    2L, 2L, 1L), .Label = c("no", "yes"), class = "factor")), .Names = c("CaseID", 
"Won", "OwnerID", "Time_period", "Finished"), row.names = c(NA, 
-8L), class = c("data.table", "data.frame")))

# order
setkey(dt, OwnerID, Time_period)

# calculate the required ratio but including current time
dt[, ratio := cumsum(Finished == "yes" & Won == "yes") /
              cumsum(Finished == "yes"),
     by = list(OwnerID)]

# shift to satisfy the strict inequality as per OP
dt[, avgWon := c(NaN, ratio[-.N]), by = OwnerID]

# take the first one for each time (that is last one from previous time)
# so that all of the outcomes happening at same time are accounted for
dt[, avgWon := avgWon[1], by = key(dt)]

dt[order(OwnerID, CaseID)]
#   CaseID Won OwnerID Time_period Finished     ratio    avgWon
#1:      1 yes       A           1       no       NaN       NaN
#2:      1 yes       A           3       no 1.0000000 1.0000000
#3:      1 yes       A           5      yes 0.6666667 0.5000000
#4:      2  no       A           4       no 0.5000000 0.5000000
#5:      2  no       A           6      yes 0.5000000 0.6666667
#6:      3 yes       A           2      yes 1.0000000       NaN
#7:      4  no       A           3      yes 0.5000000 1.0000000
#8:      5  15       B           2       no       NaN       NaN

+1 我真的不知道你是如何破解的。非常聪明。 - Simon O'Hanlon
我不确定如果单个OwnerID有多条相同的Time_period记录,这段代码是否有效。(问题出在“shift”步骤上,它有点取巧...) - Josh O'Brien
@JoshO'Brien 抱歉我有点困惑 - 它们是相同的吗...?(除了 Time_period == 3 只有两行,只有一个 rbind?我感觉我们在做不同的事情 :)) - eddi
@eddi -- 当我尝试使用 dt <- rbind(dt, dt[3,]) 时,它们并不相同,但是如果你说只有在重复的情况下,即相同的案例、时间和所有者才会出现错误,那我就相信你。 - Josh O'Brien
1
@eddi -- 我的代码出错了,因为在执行dt <- rbind(dt, dt[3,])时,dt的键被丢弃了。重置键后,它就能在这个(虽然不太重要的)情况下运行了。我现在明白了dt[, avgWon := avgWon[1], by = key(dt)]的作用,以及你的代码为什么可以正常工作。非常聪明(虽然有点晦涩难懂)! - Josh O'Brien
显示剩余5条评论

2
## Compute a data.table recording the win percentage at end of each time period
B <- dt[Finished=="yes",]
B[,winpct := (cumsum(Won=="yes")/seq_along(Won)),by=OwnerID]

## Shift forward by one time step, as per OP's description of problem
B[,Time_period := Time_period + 1]
setkeyv(B, key(dt))

## Append win percentage column back to original data.table
cbind(dt, AvgWonByOwner=B[dt, winpct, roll=TRUE][["winpct"]])
#    CaseID Won OwnerID Time_period Finished AvgWonByOwner
# 1:      1 yes       A           1       no            NA
# 2:      3 yes       A           2      yes            NA
# 3:      1 yes       A           3       no     1.0000000
# 4:      4  no       A           3      yes     1.0000000
# 5:      2  no       A           4       no     0.5000000
# 6:      1 yes       A           5      yes     0.5000000
# 7:      2  no       A           6      yes     0.6666667
# 8:      5  15       B           2       no            NA

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接