使用概率分布指定一定数量的值（在R中）

Question

使用概率分布指定一定数量的值（在R中）

7

您好，非常感谢您的帮助！

我正在尝试生成一个向量，其中包含一定数量的值，并根据概率分布进行分配。例如，我想要一个长度为31的向量，其中包含26个零和5个1。（向量的总和应始终为 5）但是，1所在的位置很重要。为了确定哪些值应该为1，哪些应该为0，我有一个概率向量（长度为31），如下所示：

probs<-c(0.01,0.02,0.01,0.02,0.01,0.01,0.01,0.04,0.01,0.01,0.12,0.01,0.02,0.01,
0.14,0.06,0.01,0.01,0.01,0.01,0.01,0.14,0.01,0.07,0.01,0.01,0.04,0.08,0.01,0.02,0.01)

我可以使用 rbinom 根据这个分布选择值并获得长度为 31 的向量，但是我无法精确地选择五个值。

Inv=rbinom(length(probs),1,probs)
Inv
[1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 0

有什么想法吗？

再次感谢！

- Laura

向量的总和应始终为一。你是指“...应始终为五”吗？ - Chase

3个回答

7

Chase提供了一个很好的答案，并提到了while()迭代的问题。run-away while()的一个问题是，如果您一次只做一个试验，并且需要很多次，比如t次才能找到符合目标数字1的数量，则会产生t次对主函数rbinom()的调用开销。

然而，有一种方法可以解决这个问题，因为像R中所有这些（伪）随机数生成器一样，rbinom()也是向量化的，我们可以一次生成m个试验，并检查这些m个试验是否符合5个1的要求。如果没有找到，则反复绘制m个试验，直到找到符合要求的试验。这个想法在下面的函数foo()中实现。chunkSize参数是m，表示一次绘制的试验数量。我还利用这个机会允许函数查找不止一个符合条件的试验；n参数控制返回多少个符合条件的试验。

foo <- function(probs, target, n = 1, chunkSize = 100) {
    len <- length(probs)
    out <- matrix(ncol = len, nrow = 0) ## return object
    ## draw chunkSize trials
    trial <- matrix(rbinom(len * chunkSize, 1, probs),
                    ncol = len, byrow = TRUE)
    rs <- rowSums(trial)  ## How manys `1`s
    ok <- which(rs == 5L) ## which meet the `target`
    found <- length(ok)   ## how many meet the target
    if(found > 0)         ## if we found some, add them to out
        out <- rbind(out,
                     trial[ok, , drop = FALSE][seq_len(min(n,found)), , 
                                               drop = FALSE])
    ## if we haven't found enough, repeat the whole thing until we do
    while(found < n) {
        trial <- matrix(rbinom(len * chunkSize, 1, probs),
                            ncol = len, byrow = TRUE)
        rs <- rowSums(trial)
        ok <- which(rs == 5L)
        New <- length(ok)
        if(New > 0) {
            found <- found + New
            out <- rbind(out, trial[ok, , drop = FALSE][seq_len(min(n, New)), , 
                                                        drop = FALSE])
        }
    }
    if(n == 1L)           ## comment this, and
        out <- drop(out)  ## this if you don't want dimension dropping
    out
}

它的工作原理如下：

> set.seed(1)
> foo(probs, target = 5)
 [1] 1 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 0 0 0 1 0 0 0 1 0 0 0 0
[31] 0
> foo(probs, target = 5, n = 2)
     [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11]
[1,]    0    0    0    0    0    0    0    0    0     0     0
[2,]    0    0    0    0    0    0    0    0    0     0     1
     [,12] [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21]
[1,]     0     0     0     1     1     0     0     0     0     0
[2,]     0     1     0     0     1     0     0     0     0     0
     [,22] [,23] [,24] [,25] [,26] [,27] [,28] [,29] [,30] [,31]
[1,]     1     0     1     0     0     0     1     0     0     0
[2,]     1     0     1     0     0     0     0     0     0     0

请注意，当 n == 1 时，我会删除空维度。如果您不需要此功能，请将最后一个 if 代码块注释掉。

您需要平衡 chunkSize 的大小和同时检查那么多次数的计算负担。如果要求（这里是 5 个 1）非常不可能发生，则增加 chunkSize，以便减少对 rbinom() 的调用次数。如果该要求很可能被满足，则每次绘制试验和大量的 chunkSize 是没有意义的，因为您必须评估每次试验的绘制。

- Gavin Simpson

+1 虽然这样的努力应该得到更好的回报。非常棒的答案，谢谢你。 - Andrie

我会重复Andrie的评论。这是一个更可伸缩解决方案。我一直在考虑向量化，但无法想出如何在这里利用它，干得好 +1。 - Chase

这很棒，但我想我需要一点时间来处理它。 :) - Laura

5

我想您希望从二项分布中重新采样，以给定的概率集合直到达到目标值为5，是这样吗？如果是这样，那么我认为这就是您想要的。一个while循环可以用来迭代，直到满足条件。如果您提供非常不切实际的概率和目标值，我猜它可能会变成一个失控的函数，所以请注意 :)

FOO <- function(probs, target) {
  out <- rbinom(length(probs), 1, probs)

  while (sum(out) != target) {

    out <- rbinom(length(probs), 1, probs)
  }
  return(out)
}

FOO(probs, target = 5)

> FOO(probs, target = 5)  
 [1] 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 1 0 0 0 1 0 0 1 0 0 0 0 1 0

- Chase

+1 @Chase，很好的回答和关于while()循环的良好解释。修复该问题是可以实现的，但会使函数更加复杂... - Gavin Simpson

谢谢！这个方法可以运行，但是需要很长时间。我正在运行1000个模拟，每个模拟都有5、10、15等目标，每个周期大约需要4个小时。让我尝试另一种方法，然后再回来告诉你。 - Laura

@Laura - Gavin和James的答案都比我的聪明一些，但也许这个简单的实现说明了如何使用“while”循环概念。 - Chase

确实！这非常有用。 :) - Laura

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- James · Accepted Answer

你可以考虑使用加权的 sample.int 来选择位置。

Inv<-integer(31)
Inv[sample.int(31,5,prob=probs)]<-1
Inv
[1] 0 0 0 1 0 1 0 0 0 0 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0