比较一个数据集和多个数据集的值，使用 R 语言

Question

比较一个数据集和多个数据集的值，使用 R 语言

rsetintersect

3

我有一个值向量(x)。

我想要确定它与列表(y)中每个集合的重叠长度，但不运行循环或lapply。这可行吗？

我真的很想加快执行速度。

非常感谢！以下是使用循环实现的示例：

x <- c(1:5)
y <- list(1:5, 2:6, 3:7, 4:8, 5:9, 6:10)
overlaps <- rep(0, length(y))
for (i in seq(length(y))) { #i=1
  # overlaps[i] <- length(intersect(x, y[[i]]))  # it is slower than %in% 
  overlaps[i] <- sum(x %in% y[[i]])
}
overlaps

以下是一些在下面回答中提出的方法的比较。正如您所看到的，循环仍然是最快的-但我希望找到更快的方法：

# Function with the loop:
myloop <- function(x, y) {
  overlaps <- rep(0, length(y))
  for (i in seq(length(y))) overlaps[i] <- sum(x %in% y[[i]])
  overlaps
}

# Function with sapply:
mysapply <- function(x, y) sapply(y, function(e) sum(e %in% x))

# Function with map_dbl:
library(purrr)
mymap <- function(x, y) {
  map_dbl(y, ~sum(. %in% x))
}

library(microbenchmark)
microbenchmark(myloop(x, y), mysapply(x, y), mymap(x, y), times = 30000)

# Unit: microseconds
#           expr  min   lq     mean median   uq      max neval
#   myloop(x, y) 17.2 19.4 26.64801   21.2 22.6   9348.6 30000
# mysapply(x, y) 27.1 29.5 39.19692   31.0 32.9  20176.2 30000
#    mymap(x, y) 59.8 64.1 88.40618   66.0 70.5 114776.7 30000

- user3245256

为什么你不想使用*apply函数？ - iod

为什么它应该比循环更快？ - user3245256

3个回答

2

您可以使用purrr中的map函数，它会遍历列表y的每个元素，并执行一个函数。下面我使用map_dbl函数返回一个向量。

library(purrr)
map_dbl(y,~+(. %in% x))
[1] 5 4 3 2 1 0

查看时间：

f1 = function(){
x <- c(1:5)
y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
map_dbl(y,~sum(. %in% x))
}

f2 = function(){
x <- c(1:5)
y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
overlaps <- rep(0, length(y))
for (i in seq(length(y))) { #i=1
    overlaps[i] <- length(intersect(x, y[[i]]))
  }
  overlaps
}

f3 = function(){
  x <- c(1:5)
  y <- lapply(1:5,function(i)sample(1:10,5,replace=TRUE))
  sapply(y,function(i)sum(i%in%x))
}

让我们进行测试：

system.time(replicate(10000,f1()))
   user  system elapsed 
   1.27    0.02    1.35 

system.time(replicate(10000,f2()))
   user  system elapsed 
   1.72    0.00    1.72 

 system.time(replicate(10000,f3()))
   user  system elapsed 
   0.97    0.00    0.97

所以如果你想要速度，可以使用sapply + %in%，如果想要更易读的代码，可以使用purrr。

- StupidWolf

我点赞了，因为它非常优雅，但不能接受它作为回答——因为我对其进行了微基准测试，发现它比“for”循环慢一点（平均执行时间比“for”循环慢11％，中位数时间比“for”循环慢3.9％）。 - user3245256

这很有趣，你在问题中从未明确提到运行时间。因此，你的代码运行缓慢的原因来自于交集和长度。使用sum（%in%）即可解决。 - StupidWolf

你需要计算创建空向量的重叠时间 :) - StupidWolf

你不能使用 system.time 进行速度测试 - 它太不精确了。我使用 microbenchmark 进行测试，而不是 system.time。我构建了 3 个函数 - 与你的相同，但每个函数将 x 和 y 作为参数，以便函数本身只包含计算。我将 times 设置为 20000 进行了运行。结果与之前相同：for 循环（最快）：平均速度 91.8 微秒，中位数 67.6 微秒；map_dbl：平均值 103.6，中位数 70.2；lapply：平均值 109.3 和中位数 78.8。 - user3245256

我曾经提到过，我不想使用循环或apply（隐藏的循环），但是我的理由当然是速度。map_dbl看起来是一个很酷的解决方案，但它并不比循环更快。 - user3245256

1

这里提供一种使用data.table的选项，如果你的y中有长列表向量，这种方法应该会很快。

library(data.table)
DT <- data.table(ID=rep(seq_along(y), lengths(y)), Y=unlist(y))
DT[.(Y=x), on=.(Y)][, .N, ID]

此外，如果您需要为多个x运行此代码，我建议在运行代码之前创建一个组合所有x的data.table。

输出：

- chinsoon12

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Louis · Accepted Answer

使用sapply让代码更加简洁。

即使与for循环相比，sapply并没有带来太多的性能优势，但至少代码更加简洁。以下是您的代码的sapply等效版本：

x <- c(1:5)
y <- list(1:5, 2:6, 3:7, 4:8, 5:9, 6:10)    
res <- sapply(y, function(e) length(intersect(e, x)))

> res
[1] 5 4 3 2 1 0

性能提升

正如@StupidWolf所述，减慢执行速度的不是sapply，而是length和intersect。这是我的测试结果，共执行100000次：

B <- 100000
system.time(replicate(B, sapply(y, function(e) length(intersect(e, x)))))
user  system elapsed 
9.79    0.01    9.79

system.time(replicate(B, sapply(y, function(e) sum(e %in% x))))
user  system elapsed 
2       0       2

#Using microbenchmark for preciser results:
library(microbenchmark)
microbenchmark(expr1 = sapply(y, function(e) length(intersect(e, x))), times = B)
expr  min   lq     mean median   uq    max neval
expr1 81.4 84.9 91.87689   86.5 88.2 7368.7 1e+05

microbenchmark(expr2 = sapply(y, function(e) sum(e %in% x)), times = B)
expr  min   lq     mean median uq    max neval
expr2 15.4 16.1 17.68144   16.4 17 7567.9 1e+05

正如我们所看到的，第二种方法是性能胜出者。

希望这可以帮助您。