我有一个数字向量,比如1到8000。我想生成恰好n
个不同的向量排列。有没有一种方法可以在不计算所有排列(因为8000!非常大而无法放入内存)的情况下完成这个任务。
function distinct_permutations(L, N){
# Return N distinct permutations of L as a list of lists
return(x)
}
x <- seq(1:8000)
我有一个数字向量,比如1到8000。我想生成恰好n
个不同的向量排列。有没有一种方法可以在不计算所有排列(因为8000!非常大而无法放入内存)的情况下完成这个任务。
function distinct_permutations(L, N){
# Return N distinct permutations of L as a list of lists
return(x)
}
x <- seq(1:8000)
set.seed(42)
RcppAlgos::permuteSample(v=10, m=10, n=3)
# [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
# [1,] 9 8 5 3 4 1 10 6 2 7
# [2,] 1 3 5 9 7 6 8 10 2 4
# [3,] 2 8 5 9 4 7 3 10 6 1
我将实现在评论中提出的3个想法中的第一个,并与其他用户提供的解决方案进行比较。我的解决方案更加通用,因为可以提供任何一组值(其他解决方案必须被视为索引),我提供了一个向量列表,而且对于许多场景,我的解决方案速度更快。
当前的解决方案:
perm_n
我自己的解决方案使用sample
和 unique
RcppAlgos::permuteSample
arrangements::permutations
编辑:添加额外的方法
perm_n <- function(vec, n) {
# sample 2 * N samples as a risk of duplicates for bigger samples
unique(
lapply(seq_len(n * 2), function(x) sample(vec, length(vec)))
)[seq_len(n)]
}
microbenchmark::microbenchmark(
RcppAlgos::permuteSample(v=10, m=10, n=10),
perm_n(1:10, 10),
arrangements::permutations(10, 10, nsample = 10)
)
#> Warning in microbenchmark::microbenchmark(RcppAlgos::permuteSample(v = 10, :
#> less accurate nanosecond times to avoid potential integer overflows
#> Unit: microseconds
#> expr min lq mean
#> RcppAlgos::permuteSample(v = 10, m = 10, n = 10) 230.174 249.7105 1605.4817
#> perm_n(1:10, 10) 65.559 68.0190 182.8387
#> arrangements::permutations(10, 10, nsample = 10) 3.649 4.1205 238.7746
#> median uq max neval
#> 1038.961 2292.556 28600.70 100
#> 71.217 82.861 10617.61 100
#> 4.633 7.298 23295.99 100
microbenchmark::microbenchmark(
RcppAlgos::permuteSample(v=8000, m=8000, n=10),
perm_n(1:8000, 10),
arrangements::permutations(8000, 8000, nsample = 10),
times = 10
)
#> Unit: milliseconds
#> expr min lq
#> RcppAlgos::permuteSample(v = 8000, m = 8000, n = 10) 199.5939 199.858108
#> perm_n(1:8000, 10) 4.3911 4.428697
#> arrangements::permutations(8000, 8000, nsample = 10) 593.0577 594.338993
#> mean median uq max neval
#> 200.44507 200.282540 200.944936 201.960178 10
#> 4.49205 4.457807 4.555182 4.663914 10
#> 600.67245 598.277678 604.146808 619.717132 10
microbenchmark::microbenchmark(
RcppAlgos::permuteSample(v=8000, m=8000, n=300),
perm_n(1:8000, 300),
arrangements::permutations(8000, 8000, nsample = 300),
times = 3
)
#> Unit: milliseconds
#> expr min lq
#> RcppAlgos::permuteSample(v = 8000, m = 8000, n = 300) 5939.2831 5965.0268
#> perm_n(1:8000, 300) 136.7174 137.3693
#> arrangements::permutations(8000, 8000, nsample = 300) 17875.7788 17877.2049
#> mean median uq max neval
#> 5984.365 5990.7704 6006.9057 6023.0410 3
#> 138.048 138.0212 138.7133 139.4053 3
#> 17891.089 17878.6310 17898.7438 17918.8566 3
此内容由 reprex v2.0.2 创建于2022年11月23日。
arrangements
包来实现这个。例如:arrangements::permutations(10, 10, nsample = 10)
。 - Joseph Woodnsample = 10
而不是 nsample = 3
。除非这是有意为之? - Joseph Woodsample
的lapply
将更高效。library(RcppAlgos)
perm_n.safe <- function(v, n) {
if (n/permuteCount(v) > 0.01) return(as.list(data.frame(t(permuteSample(v, n = n)))))
k <- 0L
out <- vector("list", n)
m <- n
while (k < m) {
s <- unique(lapply(1:n, function(i) sample(v)))
out[(k + 1L):(k + length(s))] <- s
k <- k + length(s)
n <- ceiling(1.1*n/k)
}
out
}
n
倍增,且保证返回n
个样本。set.seed(148461194)
microbenchmark::microbenchmark(
perm_n = perm_n(1:8000, 300),
perm_n.safe = perm_n.safe(1:8000, 300),
times = 10
)
#> Unit: milliseconds
#> expr min lq mean median uq max neval
#> perm_n 229.9223 231.2907 233.5303 233.0133 235.8125 237.5975 10
#> perm_n.safe 117.1336 118.6889 123.5767 119.3938 122.4304 147.8789 10
microbenchmark::microbenchmark(
perm_n = perm_n(1:5, 100),
perm_n.safe = perm_n.safe(1:5, 100)
)
#> Unit: microseconds
#> expr min lq mean median uq max neval
#> perm_n 728.8 766.75 908.462 789.85 993.4 3327.8 100
#> perm_n.safe 168.9 186.90 211.721 200.80 228.4 414.1 100
sum(lengths(perm_n(1:5, 100)) == 5)
#> [1] 92
sum(lengths(perm_n.safe(1:5, 100)) == 5)
#> [1] 100