使用
findInterval
referenceTable2 = cbind(-Inf, referenceTable)
for(x in 1:nrow(referenceTable2)){
tmp <- findInterval(lookupTable$X, referenceTable2[x,])
lookupTable[,'IsIn'] = lookupTable[,'IsIn'] + (tmp == 2)
}
lookupTable[,'IsIn'] = sign(lookupTable[,'IsIn'])
正如您所看到的,仍然存在对参考表的循环,因此如果您的参考表保持较小,则此解决方案特别有效。一些基准测试:
1)样本集:
> microbenchmark(nicholas = {set.seed(1); nref = 5; nlook = 10; referenceTable <- data.frame(A=runif(nref,min=0.25,max=0.5), B=runif(nref,min=0.50,max=0.75)); lookupTable <- data.frame(X=runif(nlook),IsIn=0); for(x in 1:nrow(lookupTable)){v <- lookupTable$X[x]; tmp <- subset(referenceTable,v >= A & v < B); lookupTable[x,'IsIn'] = as.integer(nrow(tmp) > 0)}},
+ mts = {set.seed(1); nref = 5; nlook = 10; referenceTable <- data.frame(A=runif(nref,min=0.25,max=0.5), B=runif(nref,min=0.50,max=0.75)); lookupTable <- data.frame(X=runif(nlook),IsIn=0); referenceTable2 = cbind(-Inf, referenceTable); for(x in 1:nrow(referenceTable2)){tmp <- findInterval(lookupTable$X, referenceTable2[x,]); lookupTable[,'IsIn'] = lookupTable[,'IsIn'] + (tmp == 2);}; lookupTable[,'IsIn'] = sign(lookupTable[,'IsIn'])},
+ david = {set.seed(1); nref = 5; nlook = 10; referenceTable <- data.frame(A=runif(nref,min=0.25,max=0.5), B=runif(nref,min=0.50,max=0.75)); lookupTable <- data.frame(X=runif(nlook),IsIn=0); setDT(lookupTable)[, Y := X]; setkey(setDT(referenceTable)); lookupTable[, IsIn := 0 ^ !foverlaps(lookupTable, referenceTable, by.x = c("X", "Y"), mult = "first", nomatch = 0L, which = TRUE)]},
+ times = 100)
Unit: milliseconds
expr min lq mean median uq max neval
nicholas 1.948096 2.091311 2.190386 2.150790 2.254352 4.092121 100
mts 2.520489 2.711986 2.883299 2.803421 2.885990 5.165999 100
david 6.466129 7.013798 7.344596 7.197132 7.422916 12.274029 100
2) nref = 10; nlook = 1000
Unit: milliseconds
expr min lq mean median uq max neval
nicholas 152.804680 160.681265 164.601443 163.304849 165.387296 243.250708 100
mts 4.434997 4.720027 5.025555 4.819624 4.991995 11.325172 100
david 6.505314 6.920032 7.181116 7.111529 7.331950 9.318765 100
3) nref = 200; nlook = 1000
Unit: milliseconds
expr min lq mean median uq max neval
nicholas 172.939666 179.672397 183.337253 181.191081 183.694077 264.59672 100
mts 77.836588 81.071752 83.860281 81.991919 83.484246 168.22290 100
david 6.870116 7.404256 7.736445 7.587591 7.836234 11.54349 100
我认为 David 的解决方案是最佳的。当参考时间间隔很少时,该解决方案具有优势。请注意,在您的示例中,许多这些时间间隔是重叠的,预先将它们组合起来可能会进一步改善结果。
data.table::foverlaps
或BioconductorIRanges
包。 - David ArenburgfindInterval
可能会有益,但是明天之前我没有时间发布解决方案。例如?findInterval
或http://stackoverflow.com/questions/31478022/find-most-recent-observation-r和http://stackoverflow.com/questions/34047920/extracting-names-of-vector-by-time-bin/34048151#34048151。 - mtspos2 := pos
步骤是创建一个单值“范围”的关键。 - Henrik