如何从一个可迭代的元组中填充两个或多个numpy数组?

9
我遇到的实际问题是我想要在RAM中存储一个长的排序好的 (float, str) 元组列表。一个简单的列表无法适应我的4Gb RAM,所以我想我可以使用两个 numpy.ndarray。
数据的来源是2元组的可迭代对象。numpy 有一个 fromiter 函数,但我该如何使用呢?可迭代对象中的项目数是未知的。由于内存限制,我不能先将其转换为列表。我考虑使用 itertools.tee,但这似乎会增加很多内存开销。
我猜我可以按块使用迭代器并将它们添加到数组中。那么我的问题是,如何有效地做到这一点?也许我应该创建2个二维数组并向其中添加行?(然后稍后需要将其转换为1D)。
或者也许有更好的方法?我真正需要的是通过浮点值对相应数字的字符串数组进行对数时间搜索(这就是为什么我想按浮点值进行排序),并尽可能紧凑地保持它。
附注:该可迭代对象并未排序。

使用 np.fromiter 来构建一个包含两列的单个数组是否足够? - unutbu
@unutbu ...我不确定为什么我没有考虑过那个:) 听起来是个好主意。然后我只需要沿着较长的轴排序并保持这种方式,对吧?你可以将其发布为答案,我想。 - Lev Levitsky
2个回答

8
也许可以使用np.fromiter构建一个单一的结构化数组:
import numpy as np


def gendata():
    # You, of course, have a different gendata...
    for i in xrange(N):
        yield (np.random.random(), str(i))

N = 100

arr = np.fromiter(gendata(), dtype='<f8,|S20')

按照第一列排序,使用第二列进行决胜将需要 O(N log N) 时间:

arr.sort(order=['f0','f1'])

使用searchsorted函数可以在O(log N)的时间内通过第一列的值找到相应的行:

# Some pseudo-random value in arr['f0']
val = arr['f0'][10]
print(arr[10])
# (0.049875262239617246, '46')

idx = arr['f0'].searchsorted(val)
print(arr[idx])
# (0.049875262239617246, '46')

在评论中,你提出了许多重要的问题,让我在这里尝试回答:

  • The basic dtypes are explained in the numpybook. There may be one or two extra dtypes (like float16 which have been added since that book was written, but the basics are all explained there.)

    Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.

  • Dtypes can be used to define structured arrays with column names, or with default column names. 'f0', 'f1', etc. are default column names. Since I defined the dtype as '<f8,|S20' I failed to provide column names, so NumPy named the first column 'f0', and the second 'f1'. If we had used

    dtype='[('fval','<f8'), ('text','|S20')]
    

    then the structured array arr would have column names 'fval' and 'text'.

  • Unfortunately, the dtype has to be fixed at the time np.fromiter is called. You could conceivably iterate through gendata once to discover the maximum length of the strings, build your dtype and then call np.fromiter (and iterate through gendata a second time), but that's rather burdensome. It is of course better if you know in advance the maximum size of the strings. (|S20 defines the string field as having a fixed length of 20 bytes.)
  • NumPy arrays place data of a pre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the dtype) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then it would be hard for NumPy to find the right offsets. By hard, I mean NumPy would need an index or somehow be redesigned. NumPy is simply not built this way.
  • NumPy does have an object dtype which allows you to place a 4-byte pointer to any Python object you desire. This way, you can have NumPy arrays with arbitrary Python data. Unfortunately, the np.fromiter function does not allow you to create arrays of dtype object. I'm not sure why there is this restriction...
  • Note that np.fromiter has better performance when the count is specified. By knowing the count (the number of rows) and the dtype (and thus the size of each row) NumPy can pre-allocate exactly enough memory for the resultant array. If you do not specify the count, then NumPy will make a guess for the initial size of the array, and if too small, it will try to resize the array. If the original block of memory can be extended you are in luck. But if NumPy has to allocate an entirely new hunk of memory then all the old data will have to be copied to the new location, which will slow down the performance significantly.

哇,这里有很多新东西,例如fX索引语法,但主要是您使用的dtype。首先,可能的dtypes是否有文档记录?我找到了这个,但我想要一些解释而不仅仅是例子。大小必须固定吗(如果是普通数组,我猜是这样)?因为在理想的世界中,我既不希望它有上限,也不希望短字符串占用额外空间。我能得到这样的东西吗? - Lev Levitsky
@Jaime:如果您没有指定 count,那么当数据超出预先分配的输出数组时,np.fromiter 将不得不重新调整 numpy 数组的大小。如果您有足够的连续内存,它将不必在调整大小时复制数据,并且在任何时候都不使用 Python 列表。 - unutbu
这部分涉及到了NumPy中我不熟悉的领域。根据我所理解的C源代码,当dtype(或其一部分)为object类型时,调用PyDataType_REFCHK(dtype)将会失败。我对C的理解不够深入,所以我只能建议您参考源代码 - unutbu
再次感谢您的帮助;现在我的问题是,与单独的浮点数数组上的searchsorted相比,arr['f0'].searchsorted(val)似乎非常慢。是否有任何明显的原因和解决方案,或者我应该提出一个单独的问题? - Lev Levitsky
我认为searchsorted的性能问题绝对值得在Stackoverflow上发一个问题。如果你没有得到满意的答案,也可以尝试发布到NumPy-discussion邮件列表。你会在那里获得更快的初步回应,而不是提交工单。邮件列表还将引起开发人员对你的问题的关注。 - unutbu
显示剩余4条评论

1
以下是一种从生成器中构建N个不同数组的方法,生成器的元素为N-元组:
import numpy as np
import itertools as IT


def gendata():
    # You, of course, have a different gendata...
    N = 100
    for i in xrange(N):
        yield (np.random.random(), str(i))


def fromiter(iterable, dtype, chunksize=7):
    chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
    result = [chunk[name].copy() for name in chunk.dtype.names]
    size = len(chunk)
    while True:
        chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
        N = len(chunk)
        if N == 0:
            break
        newsize = size + N
        for arr, name in zip(result, chunk.dtype.names):
            col = chunk[name]
            arr.resize(newsize, refcheck=0)
            arr[size:] = col
        size = newsize
    return result

x, y = fromiter(gendata(), '<f8,|S20')

order = np.argsort(x)
x = x[order]
y = y[order]

# Some pseudo-random value in x
N = 10
val = x[N]
print(x[N], y[N])
# (0.049875262239617246, '46')

idx = x.searchsorted(val)
print(x[idx], y[idx])
# (0.049875262239617246, '46')

上面的fromiter函数按照指定大小的块读取可迭代对象。它调用NumPy数组方法resize来根据需要扩展结果数组。
由于我在小数据上测试这段代码,所以使用了较小的默认chunksize。您当然可以更改默认chunksize或传递一个具有较大值的chunksize参数。

是的,分块读取也在我的考虑范围内,谢谢你提供这个很好的例子。我们能否在这里将“chunksize”传递给“np.fromiter”以加快速度? - Lev Levitsky
不幸的是,我看不到任何解决方法。如果我们使用 count=chunksize,则当可迭代对象包含少于 chunksize 个项目时,调用 np.fromiter 可能会失败。如果我们尝试在 try..except 块中捕获它,那么我们将会丢失数据,因为可迭代对象只适用于一次遍历。 - unutbu

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接