如何进行不重复的逐步抽样？

Question

如何进行不重复的逐步抽样？

pythonalgorithmstatisticsprobabilitysample

32

Python中有my_sample = random.sample(range(100), 10)以从[0,100)的范围内无重复随机采样。

假设我已经随机采样了n个数，现在我想要再采样一个数而不重复（不包括之前采样的n个数），如何高效地实现？

更新：从“相对高效”更改为“超级高效”（但忽略常数因子）

- necromancer

1

你是否只想在 [0, x) 范围内采样整数？你期望的 x 是多少？ - Chronial

[0, n) 对我来说完全可行。我可以让任何问题适应它。 - necromancer

这是你需要的吗？让另一个问题适应它会花费大量时间，并且考虑到您所要求的紧密边界，这非常重要。 - Chronial

1

你可能想查看 random.sample 的源代码。 - Eric

2

这个线程真是太棒了！一个简单的问题最终需要支付300分赏金来表达对惊人答案的感激之情。其中一个人提供了4个答案，另一个人提供了3个答案，而原帖作者提供了一个作为正确答案基础的回答。还有一个惊人的类似论文的答案，实际上包含了多个子答案。希望大家都能得出满意的结论。谢谢大家。 :-) - necromancer

13个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Chronial · Answer 1

一行简洁的代码（O(n + m)，其中 n 代表范围，m 代表旧样本量）：

next_sample = random.sample(set(range(100)).difference(my_sample), 10)

- legel · Answer 2

这在核心函数中尚未实现，令人惊讶。但是以下是干净的版本，它返回抽样值和不重复的列表：

def sample_n_points_without_replacement(n, set_of_points):
    sampled_point_indices = random.sample(range(len(set_of_points)), n)
    sampled_point_indices.sort(reverse=True)
    sampled_points = [set_of_points[sampled_point_index] for sampled_point_index in sampled_point_indices]
    for sampled_point_index in sampled_point_indices:
        del(set_of_points[sampled_point_index])
    return sampled_points, set_of_points

- Patrick Da Silva · Answer 3

这是一个侧记：假设您想要解决与在列表（我将其称为sample_space）上进行无替换抽样的完全相同问题，但是您不是在尚未抽样的元素集合上均匀抽样，而是给定了一个初始概率分布p，告诉您在整个空间中抽样时抽取第i^th个元素的概率。

然后，使用numpy的以下实现是数值稳定的：

import numpy as np

def iterative_sampler(sample_space, p=None):
    """
        Samples elements from a sample space (a list) 
        with a given probability distribution p (numPy array) 
        without replacement. If called until StopIteration is raised,
        effectively produces a permutation of the sample space.
    """
    if p is None:
        p = np.array([1/len(sample_space) for _ in sample_space])

    try:
        assert isinstance(sample_space, list)
        assert isinstance(p, np.ndarray)
    except AssertionError:
        raise TypeError("Required types: \nsample_space: list \np type: np.ndarray")

    # Main loop
    n = len(sample_space)   
    idxs_left = list(range(n))
    for i in range(n):
        idx = np.random.choice(
            range(n-i), 
            p= p[idxs_left] / p[idxs_left].sum()
        )
        yield sample_space[idxs_left[idx]]
        del idxs_left[idx]

这篇文章简短而精炼，我很喜欢。请告诉我你们的想法！