Python中random.sample和random.shuffle有什么区别?

7
我有一个包含1500个元素的列表a_tot,我想以随机方式将此列表分成两个列表。a_1列表将有1300个元素,而a_2列表将有200个元素。我的问题是关于最佳方法来随机化原始的1500个元素的列表。当我随机化列表后,我可以取一个包含1300个元素的切片和另一个包含200个元素的切片。 一种方法是使用random.shuffle,另一种方法是使用random.sample。这两种方法在随机化质量方面有什么区别吗?列表1中的数据应该是随机样本,列表2中的数据也应该是随机样本。 有什么建议吗? 使用shuffle:
random.shuffle(a_tot)    #get a randomized list
a_1 = a_tot[0:1300]     #pick the first 1300
a_2 = a_tot[1300:]      #pick the last 200

使用示例

new_t = random.sample(a_tot,len(a_tot))    #get a randomized list
a_1 = new_t[0:1300]     #pick the first 1300
a_2 = new_t[1300:]      #pick the last 200
6个回答

5

洗牌的源代码:

def shuffle(self, x, random=None, int=int):
    """x, random=random.random -> shuffle list x in place; return None.

    Optional arg random is a 0-argument function returning a random
    float in [0.0, 1.0); by default, the standard random.random.
    """

    if random is None:
        random = self.random
    for i in reversed(xrange(1, len(x))):
        # pick an element in x[:i+1] with which to exchange x[i]
        j = int(random() * (i+1))
        x[i], x[j] = x[j], x[i]

示例代码的源代码:

def sample(self, population, k):
    """Chooses k unique random elements from a population sequence.

    Returns a new list containing elements from the population while
    leaving the original population unchanged.  The resulting list is
    in selection order so that all sub-slices will also be valid random
    samples.  This allows raffle winners (the sample) to be partitioned
    into grand prize and second place winners (the subslices).

    Members of the population need not be hashable or unique.  If the
    population contains repeats, then each occurrence is a possible
    selection in the sample.

    To choose a sample in a range of integers, use xrange as an argument.
    This is especially fast and space efficient for sampling from a
    large population:   sample(xrange(10000000), 60)
    """

    # XXX Although the documentation says `population` is "a sequence",
    # XXX attempts are made to cater to any iterable with a __len__
    # XXX method.  This has had mixed success.  Examples from both
    # XXX sides:  sets work fine, and should become officially supported;
    # XXX dicts are much harder, and have failed in various subtle
    # XXX ways across attempts.  Support for mapping types should probably
    # XXX be dropped (and users should pass mapping.keys() or .values()
    # XXX explicitly).

    # Sampling without replacement entails tracking either potential
    # selections (the pool) in a list or previous selections in a set.

    # When the number of selections is small compared to the
    # population, then tracking selections is efficient, requiring
    # only a small set and an occasional reselection.  For
    # a larger number of selections, the pool tracking method is
    # preferred since the list takes less space than the
    # set and it doesn't suffer from frequent reselections.

    n = len(population)
    if not 0 <= k <= n:
        raise ValueError, "sample larger than population"
    random = self.random
    _int = int
    result = [None] * k
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize or hasattr(population, "keys"):
        # An n-length list is smaller than a k-length set, or this is a
        # mapping type so the other algorithm wouldn't work.
        pool = list(population)
        for i in xrange(k):         # invariant:  non-selected at [0,n-i)
            j = _int(random() * (n-i))
            result[i] = pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        try:
            selected = set()
            selected_add = selected.add
            for i in xrange(k):
                j = _int(random() * n)
                while j in selected:
                    j = _int(random() * n)
                selected_add(j)
                result[i] = population[j]
        except (TypeError, KeyError):   # handle (at least) sets
            if isinstance(population, list):
                raise
            return self.sample(tuple(population), k)
    return result

如你所见,在这两种情况下,随机化实际上是通过int(random() * n)这行代码完成的。因此,底层算法本质上是相同的。


请注意注释--如果您有一个可以随意打乱的列表(取决于大小),那么它可能更有效,因为您不必检查是否已经选择了特定元素。 - mgilson

4

shuffle()sample()有两个主要区别:

1) shuffle()会就地修改数据,因此其输入必须是可变序列。相比之下,sample()生成一个新列表,其输入可以更加多样化(元组、字符串、xrange、bytearray、set等)。

2) sample()让你潜在地做更少的工作(即部分洗牌)。

有趣的是,通过演示可以实现将shuffle()转化为sample()来展示两者之间的概念关系:

def shuffle(p):
   p[:] = sample(p, len(p))

或者反过来,通过shuffle()实现sample()

def sample(p, k):
   p = list(p)
   shuffle(p)
   return p[:k]

这两种方法在实现shuffle()和sample()时都不够高效,但它们展示了它们的概念关系。


1

random.shuffle() 会就地打乱给定的list。它的长度保持不变。

random.sample() 从给定序列中随机选择n个项目,不重复(也可以是元组或其他任何具有__len__()的对象),并以随机顺序返回它们。


1
随机化应该在两个选项中都很好。我建议选择shuffle,因为它更直观地告诉读者它的作用。

0
from random import shuffle
from random import sample 
x = [[i] for i in range(10)]
shuffle(x)
sample(x,10)

shuffle 会在同一个列表中更新输出,但 sample 则返回更新后的列表。sample 提供了图片工具中参数的数量,而 shuffle 则提供了与输入长度相同的列表。

0

我认为它们基本相同,除了一个更新原始列表,另一个只是使用(只读)它。在质量上没有区别。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接