Python 3.1 - 在对一个大列表进行采样时出现内存错误

Question

Python 3.1 - 在对一个大列表进行采样时出现内存错误

3

输入的列表可能超过100万个数字。当我使用较小的'repeats'运行以下代码时，它很好；

def sample(x):
    length = 1000000 
    new_array = random.sample((list(x)),length)
    return (new_array)

def repeat_sample(x):    
    i = 0
    repeats = 100
    list_of_samples = []
    for i in range(repeats):
       list_of_samples.append(sample(x))
    return(list_of_samples)

repeat_sample(large_array)

然而，使用高重复次数（如100次以上）会导致MemoryError。回溯信息如下：

Traceback (most recent call last):
  File "C:\Python31\rnd.py", line 221, in <module>
    STORED_REPEAT_SAMPLE = repeat_sample(STORED_ARRAY)
  File "C:\Python31\rnd.py", line 129, in repeat_sample
    list_of_samples.append(sample(x))
  File "C:\Python31\rnd.py", line 121, in sample
    new_array = random.sample((list(x)),length)
  File "C:\Python31\lib\random.py", line 309, in sample
    result = [None] * k
MemoryError

我猜测我的内存不足了。我不知道如何解决这个问题。

感谢您的时间！

- jimy

更改你的算法？样本是用来做什么的？难道你不能逐个样本地进行吗？ - TryPyPy

您可以重新配置系统，以便拥有更多的虚拟内存--通常意味着更多的空闲硬盘空间。 - martineau

5个回答

4

有两种答案：

除非你使用的是旧计算机，否则你不太可能真正耗尽内存。你会收到一个 MemoryError 错误信息，因为你可能正在使用 Python 的 32 位版本，而你无法分配超过 2GB 的内存。
你的方法是错误的。你应该使用随机样本生成器，而不是构建样本列表。

- Virgil Dupras

啊，我明白了关于32位的事情。谢谢。 - jimy

1

一个 random.sample() 的生成器版本也会很有帮助：

from random import random
from math import ceil as _ceil, log as _log

def xsample(population, k):
    """A generator version of random.sample"""
    n = len(population)
    if not 0 <= k <= n:
        raise ValueError("sample larger than population")
    _int = int
    setsize = 21        # size of a small set minus size of an empty list
    if k > 5:
        setsize += 4 ** _ceil(_log(k * 3, 4)) # table size for big sets
    if n <= setsize or hasattr(population, "keys"):
        # An n-length list is smaller than a k-length set, or this is a
        # mapping type so the other algorithm wouldn't work.
        pool = list(population)
        for i in range(k):         # invariant:  non-selected at [0,n-i)
            j = _int(random() * (n-i))
            yield pool[j]
            pool[j] = pool[n-i-1]   # move non-selected item into vacancy
    else:
        try:
            selected = set()
            selected_add = selected.add
            for i in range(k):
                j = _int(random() * n)
                while j in selected:
                    j = _int(random() * n)
                selected_add(j)
                yield population[j]
        except (TypeError, KeyError):   # handle (at least) sets
            if isinstance(population, list):
                raise
            for x in sample(tuple(population), k):
                yield x

- Lennart Regebro

0

你唯一可以做的改进就是修改你的代码为：

list_of_samples = [random.sample(x, length) for _ in range(repeats)]

然而，这并不改变事实，在现实世界中你无法创建任意长度的列表。

- SilentGhost

0

你可以尝试使用数组对象http://docs.python.org/py3k/library/array.html。它应该比列表更节省内存，但可能会稍微难用一些。

- Zuljin

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- TryPyPy · Accepted Answer

在我的评论上进行拓展：

假设您对每个样本所做的处理是计算其平均值。

def mean(samplelists):
    means = []
    n = float(len(samplelists[0]))
    for sample in samplelists:
        mean = sum(sample)/n
        means.append(mean)
    return means

calc_means(repeat_sample(large_array))

如果你在内存中保存了许多列表，那么这将使你感到很吃力。你可以通过以下方式使它变得更轻便：

def mean(sample, n):
    n = float(n)
    mean = sum(sample)/n
    return mean

def sample(x):
    length = 1000000 
    new_array = random.sample(x, length)
    return new_array

def repeat_means(x):    
    repeats = 100
    list_of_means = []
    for i in range(repeats):
        list_of_means.append(mean(sample(x)))
    return list_of_means    

repeat_means(large_array)

但这还不够好...你可以只构建结果列表来完成所有操作：

import random

def sampling_mean(population, k, times):
    # Part of this is lifted straight from random.py
    _int = int
    _random = random.random

    n = len(population)
    kf = float(k)
    result = []

    if not 0 <= k <= n:
        raise ValueError, "sample larger than population"

    for t in range(times):
        selected = set()
        sum_ = 0
        selected_add = selected.add

        for i in xrange(k):
            j = _int(_random() * n)
            while j in selected:
                j = _int(_random() * n)
            selected_add(j)
            sum_ += population[j]

        mean = sum_/kf
        result.append(mean)
    return result

sampling_mean(x, 1000000, 100)

现在，您的算法能像这样优化吗？