如何将此操作向量化

4

在R中,我该如何向量化这个操作?

我有一张参考值表,其中包含一个下限(A)和上限(B)。

我还有一个值(X)的表格要在上述表格中进行查找。

对于每个X的值,我需要确定它是否在参考表格中的任何A和B的值之间。

为了演示上面的内容,这里使用循环来解决:

#For Reproduceability,
set.seed(1); 

#Set up the Reference and Lookup Tables
nref = 5; nlook = 10
referenceTable <- data.frame(A=runif(nref,min=0.25,max=0.5),
                             B=runif(nref,min=0.50,max=0.75));
lookupTable    <- data.frame(X=runif(nlook),IsIn=0)

#Process for each row in the lookup table
#search for at least one match in the reference table where A <= X < B
for(x in 1:nrow(lookupTable)){
  v   <- lookupTable$X[x]
  tmp <- subset(referenceTable,v >= A & v < B)
  lookupTable[x,'IsIn'] = as.integer(nrow(tmp) > 0)
}

我试图移除for(x in .... )组件,因为我的现实问题中的表格有很多很多条记录。


1
这种类型的问题在SO上已经被问了很多次。请搜索data.table::foverlaps或Bioconductor IRanges包。 - David Arenburg
如果“apply()”函数在这里不是一个好选择(并不比原始的“for”循环更好),那么什么是一个好选择? - Tim Biegeleisen
我建议在这里使用findInterval可能会有益,但是明天之前我没有时间发布解决方案。例如?findInterval或http://stackoverflow.com/questions/31478022/find-most-recent-observation-r和http://stackoverflow.com/questions/34047920/extracting-names-of-vector-by-time-bin/34048151#34048151。 - mts
在相关的问答中,有一个很好的回答:链接。注意 pos2 := pos 步骤是创建一个单值“范围”的关键。 - Henrik
2个回答

6

我找不到完全匹配的内容,因此这里提供一个可能的解决方案,使用 data.table::foverlaps。首先,我们需要在lookupTable中添加一列以创建两侧的边界。然后将referenceTable设置为键(必须这样做才能使用foverlaps),然后运行简单的重叠连接,同时仅选择第一个连接,因为您想要任何连接(我使用了0^将其转换为二进制,因为您不想要实际位置)。

library(data.table)
setDT(lookupTable)[, Y := X] # Add an additional boundary column
setkey(setDT(referenceTable)) # Key the referenceTable data set
lookupTable[, IsIn := 0 ^ !foverlaps(lookupTable, 
                                     referenceTable, 
                                     by.x = c("X", "Y"),
                                     mult = "first", 
                                     nomatch = 0L, 
                                     which = TRUE)]
#             X IsIn         Y
#  1: 0.2059746    0 0.2059746
#  2: 0.1765568    0 0.1765568
#  3: 0.6870228    1 0.6870228
#  4: 0.3841037    1 0.3841037
#  5: 0.7698414    0 0.7698414
#  6: 0.4976992    1 0.4976992
#  7: 0.7176185    1 0.7176185
#  8: 0.9919061    0 0.9919061
#  9: 0.3800352    1 0.3800352
# 10: 0.7774452    0 0.7774452

0
使用 findInterval
referenceTable2 = cbind(-Inf, referenceTable)

for(x in 1:nrow(referenceTable2)){
  tmp <- findInterval(lookupTable$X, referenceTable2[x,])
  lookupTable[,'IsIn'] = lookupTable[,'IsIn'] + (tmp == 2)
}
lookupTable[,'IsIn'] = sign(lookupTable[,'IsIn'])

正如您所看到的,仍然存在对参考表的循环,因此如果您的参考表保持较小,则此解决方案特别有效。一些基准测试:

1)样本集:

> microbenchmark(nicholas = {set.seed(1); nref = 5; nlook = 10; referenceTable <- data.frame(A=runif(nref,min=0.25,max=0.5), B=runif(nref,min=0.50,max=0.75)); lookupTable <- data.frame(X=runif(nlook),IsIn=0); for(x in 1:nrow(lookupTable)){v   <- lookupTable$X[x]; tmp <- subset(referenceTable,v >= A & v < B); lookupTable[x,'IsIn'] = as.integer(nrow(tmp) > 0)}}, 
+                mts = {set.seed(1); nref = 5; nlook = 10; referenceTable <- data.frame(A=runif(nref,min=0.25,max=0.5), B=runif(nref,min=0.50,max=0.75)); lookupTable <- data.frame(X=runif(nlook),IsIn=0); referenceTable2 = cbind(-Inf, referenceTable); for(x in 1:nrow(referenceTable2)){tmp <- findInterval(lookupTable$X, referenceTable2[x,]); lookupTable[,'IsIn'] = lookupTable[,'IsIn'] + (tmp == 2);}; lookupTable[,'IsIn'] = sign(lookupTable[,'IsIn'])},
+                david = {set.seed(1); nref = 5; nlook = 10; referenceTable <- data.frame(A=runif(nref,min=0.25,max=0.5), B=runif(nref,min=0.50,max=0.75)); lookupTable <- data.frame(X=runif(nlook),IsIn=0); setDT(lookupTable)[, Y := X]; setkey(setDT(referenceTable)); lookupTable[, IsIn := 0 ^ !foverlaps(lookupTable, referenceTable, by.x = c("X", "Y"), mult = "first", nomatch = 0L, which = TRUE)]},
+                times = 100)
Unit: milliseconds
     expr      min       lq     mean   median       uq       max neval
 nicholas 1.948096 2.091311 2.190386 2.150790 2.254352  4.092121   100
      mts 2.520489 2.711986 2.883299 2.803421 2.885990  5.165999   100
    david 6.466129 7.013798 7.344596 7.197132 7.422916 12.274029   100

2) nref = 10; nlook = 1000

Unit: milliseconds
     expr        min         lq       mean     median         uq        max neval
 nicholas 152.804680 160.681265 164.601443 163.304849 165.387296 243.250708   100
      mts   4.434997   4.720027   5.025555   4.819624   4.991995  11.325172   100
    david   6.505314   6.920032   7.181116   7.111529   7.331950   9.318765   100

3) nref = 200; nlook = 1000

Unit: milliseconds
     expr        min         lq       mean     median         uq       max neval
 nicholas 172.939666 179.672397 183.337253 181.191081 183.694077 264.59672   100
      mts  77.836588  81.071752  83.860281  81.991919  83.484246 168.22290   100
    david   6.870116   7.404256   7.736445   7.587591   7.836234  11.54349   100

我认为 David 的解决方案是最佳的。当参考时间间隔很少时,该解决方案具有优势。请注意,在您的示例中,许多这些时间间隔是重叠的,预先将它们组合起来可能会进一步改善结果。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接