按条件或行索引列表对 R data.table 进行筛选

4

我有一个包含距离的数据表。我想通过我的"id"变量和包括的距离阈值(例如,Dist<1,Dist<2等)在数据表中运行各种操作。 我知道如何按id和距离运行操作"by=list(id,Dist)",但我真的希望有一个更像的by变量"by=list(id,c(Dist<=1,Dist<=2,Dist<=3,Dist<=4,Dist<=5))。下面是我数据结构和目标的示例。

#load library
    library(data.table)
#create data  
    set.seed(123L)
    dt<-data.table(id=factor(rep(1:10,5)),V1=rnorm(50,5,5),Dist=sample(1:5,50,replace=T))
#calculate mean of V1 by id and distance (wrong results)
    dt2<-dt[,.(MeanV1=mean(V1)),by=list(id,Dist)]
#calculate mean of V1 by id and conditional distance (right results, wrong method)
    dt2.1<-dt[Dist<=1,.(MeanV1=mean(V1)),by=id]
    dt2.2<-dt[Dist<=2,.(MeanV1=mean(V1)),by=id]
    dt2.3<-dt[Dist<=3,.(MeanV1=mean(V1)),by=id]
    dt2.4<-dt[Dist<=4,.(MeanV1=mean(V1)),by=id]
    dt2.5<-dt[Dist<=5,.(MeanV1=mean(V1)),by=id]
    dt2<-rbind(dt2.1,dt2.2,dt2.3,dt2.4,dt2.5)
#ideal methods if either were valid
    #syntax 1
        dt2<-dt[,.(MeanV1=mean(V1)),by=list(id,c(Dist<=1,Dist<=2,Dist<=3,Dist<=4,Dist<=5))]
    #syntax 2
        rowindices<-list(dt$Dist<=1,dt$Dist<=2,dt$Dist<=3,dt$Dist<=4,dt$Dist<=5)
        dt2<-dt[,.(MeanV1=mean(V1)),by=list(id,rowindices)]

感谢您的提前预祝。

1
dt[.(cutid = 1:5, dcut = 1:5), on=.(Dist <= dcut), allow.cartesian=TRUE][, mean(V1), keyby=.(cutid, id)] 这个例子可以工作吗?不过,如果你正在计算平均值,有更有效的方法。 - Frank
1
太棒了,非常感谢!是的,我只是举了均值作为例子。我正在做比那更复杂的事情。再次感谢! - Brian A
2个回答

4

弗兰克在评论中的回答可以实现您想要的功能。以下是解释:

首先,您可以使用data.table进行“非等连接”,这就是第一个data.table调用正在执行的操作。

首先,我们创建一个阈值表,以便对其进行操作:

> thresholds <- data.table(dist_threshold=1:5)
> thresholds
   dist_threshold
1:              1
2:              2
3:              3
4:              4
5:              5

接下来,我们对原始表格和阈值表格执行非等连接:这将创建一个新表格,其中dist列包含每个ID低于该阈值的所有条目:

> passes_threshold <- dt[thresholds, on=.(Dist < dist_threshold), # non-equi join
+    allow.cartesian=TRUE, # Fixes error, see details in ?data.table
+    nomatch=0 # Do not include thresholds which no row satisfies (i.e. Dist < 1)
+   ]
> passes_threshold
# Here the Dist column now means "Dist < dist_threshold".
# There will be 5 rows where Dist < 2, 19 where Dist < 3, 
# 30 where Dist < 4, and 40 Where Dist < 5 
    id        V1 Dist
 1:  8  8.521825    2
 2:  5  2.002523    2
 3:  6  8.698732    2
 4:  9 -1.701028    2
 5:  2  6.114119    2
---                  
90:  6 -1.392776    5
91: 10  9.033493    5
92:  1  9.565713    5
93:  5  4.579124    5
94:  7  1.498690    5

我们现在可以将连接操作与jby参数中的汇总操作相结合,计算每个阈值的平均距离:
> passes_threshold[,.(mean_Dist_by_threshold=mean(V1)), by=.(threshold=Dist)]
       threshold mean_Dist_per_threshold
1:         2                4.727234
2:         3                4.615258
3:         4                4.202856
4:         5                4.559240

谢谢!这是一个很好的非等值连接解释,并且说明了data.table语法在做什么。 - Brian A

3
作为Scott's answer的补充,他的解决方案可以更简洁地写成
dt[.(1:5), on = .(Dist < V1), allow = TRUE, nomatch = 0][
  , .(mean_Dist_by_threshold = mean(V1)), by = .(threshold = Dist)]

在这里,.(1:5)会即时创建thresholds,然后使用data.table表达式进行链接。 或者,聚合可以在加入时使用by = .EACHI完成:
dt[.(1:5), on = .(Dist < V1), nomatch = 0, 
   .(mean_Dist_by_threshold = mean(V1)), by = .EACHI][
     , setnames(.SD, "Dist", "threshold")]

调用 setnames() 只是为了方便返回与 Scott 的答案相同的结果。

基准测试代码

library(data.table)
# create data
nr <- 5e2L
set.seed(123L) # to make the data reproducible
dt <-
  data.table(
    id = factor(rep(1:10, nr / 10)),
    V1 = rnorm(nr, 5, 5),
    Dist = sample(1:5, nr, replace = T)
  )
str(dt)

microbenchmark::microbenchmark(
  scott = {
    thresholds <- data.table(dist_threshold=1:5)
    passes_threshold <-
      dt[thresholds, on = .(Dist < dist_threshold), # non-equi join
         allow.cartesian = TRUE, # Fixes error, see details in ?data.table
         nomatch = 0 # Do not include thresholds which no row satisfies (i.e. Dist < 1)
         ]
    passes_threshold[, .(mean_Dist_by_threshold = mean(V1)), by = .(threshold = Dist)]
  },
  uwe1 = {
    dt[.(1:5), on = .(Dist < V1), allow = TRUE, nomatch = 0][
      , .(mean_Dist_by_threshold = mean(V1)), by = .(threshold = Dist)]
  },
  uwe2 = {
    dt[.(1:5), on = .(Dist < V1), nomatch = 0, 
       .(mean_Dist_by_threshold = mean(V1)), by = .EACHI][
         , setnames(.SD, "Dist", "threshold")]
  },
  times = 100L
)

基准测试结果

在500行的情况下,三种变体之间只有轻微差异,其中链式略优于Scott's,by = .EACHI略逊。

Unit: milliseconds
  expr      min       lq     mean   median       uq       max neval cld
 scott 1.460058 1.506854 1.618048 1.526019 1.726257  4.768493   100  a 
  uwe1 1.302760 1.327686 1.487237 1.338926 1.372498 12.733933   100  a 
  uwe2 1.827756 1.864777 1.944920 1.888349 2.020097  2.233269   100   b

有50000行数据时,chaining仍然略胜于Scott's,但by = .EACHI已经超越了其他方法。

Unit: milliseconds
  expr      min       lq     mean   median       uq       max neval cld
 scott 3.692545 3.811466 4.016152 3.826423 3.853489 10.336598   100   b
  uwe1 3.560786 3.632999 3.936583 3.642526 3.657992 13.579008   100   b
  uwe2 2.503508 2.545722 2.577735 2.566869 2.602586  2.798692   100  a

有500万行,这变得更加明显:

Unit: milliseconds
  expr      min       lq     mean   median       uq      max neval cld
 scott 641.9945 675.3749 743.0761 708.7552 793.6170 878.4787     3   b
  uwe1 587.1724 587.5557 589.1360 587.9391 590.1178 592.2965     3   b
  uwe2 130.9358 134.6688 157.1860 138.4019 170.3110 202.2202     3  a 

速度差异的一个解释可能是中间结果 passes_threshold 的大小超过了10M行(这就是为什么需要 allow.cartesian = TRUE)。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接