如何从一个可迭代的元组中填充两个或多个numpy数组？

Question

如何从一个可迭代的元组中填充两个或多个numpy数组？

pythonarraysnumpyiteration

9

我遇到的实际问题是我想要在RAM中存储一个长的排序好的 (float, str) 元组列表。一个简单的列表无法适应我的4Gb RAM，所以我想我可以使用两个 numpy.ndarray。

数据的来源是2元组的可迭代对象。numpy 有一个 fromiter 函数，但我该如何使用呢？可迭代对象中的项目数是未知的。由于内存限制，我不能先将其转换为列表。我考虑使用 itertools.tee，但这似乎会增加很多内存开销。

我猜我可以按块使用迭代器并将它们添加到数组中。那么我的问题是，如何有效地做到这一点？也许我应该创建2个二维数组并向其中添加行？（然后稍后需要将其转换为1D）。

或者也许有更好的方法？我真正需要的是通过浮点值对相应数字的字符串数组进行对数时间搜索（这就是为什么我想按浮点值进行排序），并尽可能紧凑地保持它。

附注：该可迭代对象并未排序。

- Lev Levitsky

使用 np.fromiter 来构建一个包含两列的单个数组是否足够？ - unutbu

@unutbu ...我不确定为什么我没有考虑过那个:) 听起来是个好主意。然后我只需要沿着较长的轴排序并保持这种方式，对吧？你可以将其发布为答案，我想。 - Lev Levitsky

2个回答

1

以下是一种从生成器中构建N个不同数组的方法，生成器的元素为N-元组:

import numpy as np
import itertools as IT


def gendata():
    # You, of course, have a different gendata...
    N = 100
    for i in xrange(N):
        yield (np.random.random(), str(i))


def fromiter(iterable, dtype, chunksize=7):
    chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
    result = [chunk[name].copy() for name in chunk.dtype.names]
    size = len(chunk)
    while True:
        chunk = np.fromiter(IT.islice(iterable, chunksize), dtype=dtype)
        N = len(chunk)
        if N == 0:
            break
        newsize = size + N
        for arr, name in zip(result, chunk.dtype.names):
            col = chunk[name]
            arr.resize(newsize, refcheck=0)
            arr[size:] = col
        size = newsize
    return result

x, y = fromiter(gendata(), '<f8,|S20')

order = np.argsort(x)
x = x[order]
y = y[order]

# Some pseudo-random value in x
N = 10
val = x[N]
print(x[N], y[N])
# (0.049875262239617246, '46')

idx = x.searchsorted(val)
print(x[idx], y[idx])
# (0.049875262239617246, '46')

上面的fromiter函数按照指定大小的块读取可迭代对象。它调用NumPy数组方法resize来根据需要扩展结果数组。

由于我在小数据上测试这段代码，所以使用了较小的默认chunksize。您当然可以更改默认chunksize或传递一个具有较大值的chunksize参数。

- unutbu

是的，分块读取也在我的考虑范围内，谢谢你提供这个很好的例子。我们能否在这里将“chunksize”传递给“np.fromiter”以加快速度？ - Lev Levitsky

不幸的是，我看不到任何解决方法。如果我们使用 count=chunksize，则当可迭代对象包含少于 chunksize 个项目时，调用 np.fromiter 可能会失败。如果我们尝试在 try..except 块中捕获它，那么我们将会丢失数据，因为可迭代对象只适用于一次遍历。 - unutbu

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- unutbu · Accepted Answer

也许可以使用np.fromiter构建一个单一的结构化数组：

import numpy as np


def gendata():
    # You, of course, have a different gendata...
    for i in xrange(N):
        yield (np.random.random(), str(i))

N = 100

arr = np.fromiter(gendata(), dtype='<f8,|S20')

按照第一列排序，使用第二列进行决胜将需要 O(N log N) 时间：

arr.sort(order=['f0','f1'])

使用searchsorted函数可以在O(log N)的时间内通过第一列的值找到相应的行：

# Some pseudo-random value in arr['f0']
val = arr['f0'][10]
print(arr[10])
# (0.049875262239617246, '46')

idx = arr['f0'].searchsorted(val)
print(arr[idx])
# (0.049875262239617246, '46')

在评论中，你提出了许多重要的问题，让我在这里尝试回答：

The basic dtypes are explained in the numpybook. There may be one or two extra dtypes (like float16 which have been added since that book was written, but the basics are all explained there.)

Perhaps a more thorough discussion is in the online documentation. Which is a good supplement to the examples you mentioned here.
Dtypes can be used to define structured arrays with column names, or with default column names. 'f0', 'f1', etc. are default column names. Since I defined the dtype as '<f8,|S20' I failed to provide column names, so NumPy named the first column 'f0', and the second 'f1'. If we had used
```
dtype='[('fval','<f8'), ('text','|S20')]
```
then the structured array arr would have column names 'fval' and 'text'.
Unfortunately, the dtype has to be fixed at the time np.fromiter is called. You could conceivably iterate through gendata once to discover the maximum length of the strings, build your dtype and then call np.fromiter (and iterate through gendata a second time), but that's rather burdensome. It is of course better if you know in advance the maximum size of the strings. (|S20 defines the string field as having a fixed length of 20 bytes.)
NumPy arrays place data of a pre-defined size in arrays of a fixed size. Think of the array (even multidimensional ones) as a contiguous block of one-dimensional memory. (That's an oversimplification -- there are non-contiguous arrays -- but will help your imagination for the following.) NumPy derives much of its speed by taking advantage of the fixed sizes (set by the dtype) to quickly compute the offsets needed to access elements in the array. If the strings had variable sizes, then it would be hard for NumPy to find the right offsets. By hard, I mean NumPy would need an index or somehow be redesigned. NumPy is simply not built this way.
NumPy does have an object dtype which allows you to place a 4-byte pointer to any Python object you desire. This way, you can have NumPy arrays with arbitrary Python data. Unfortunately, the np.fromiter function does not allow you to create arrays of dtype object. I'm not sure why there is this restriction...
Note that np.fromiter has better performance when the count is specified. By knowing the count (the number of rows) and the dtype (and thus the size of each row) NumPy can pre-allocate exactly enough memory for the resultant array. If you do not specify the count, then NumPy will make a guess for the initial size of the array, and if too small, it will try to resize the array. If the original block of memory can be extended you are in luck. But if NumPy has to allocate an entirely new hunk of memory then all the old data will have to be copied to the new location, which will slow down the performance significantly.