numpy数组大小与连接速度的关系

Question

5

我正在像这样将数据连接到numpy数组中：

xdata_test = np.concatenate((xdata_test,additional_X))

这是一千次完成的操作。数组的数据类型为float32，它们的大小如下所示：

xdata_test.shape   :  (x1,40,24,24)        (x1 : [500~10500])   
additional_X.shape :  (x2,40,24,24)        (x2 : [0 ~ 500])

问题在于当x1大于2000-3000时，连接操作需要更长的时间。

下面的图表显示了连接时间与x2维度大小之间的关系：

这是内存问题还是numpy的基本特性？

- MJ.Shin

2个回答

5

如果您提前拥有要连接的数组，我建议创建一个新数组，并用小数组填充总形状，而不是进行连接，因为每个连接操作都需要将整个数据复制到新的连续内存空间中。

First, calculate the total size of the first axis:

max_x = 0
for arr in list_of_arrays:
    max_x += arr.shape[0]

Second, create the end container:
```
final_data = np.empty((max_x,) + xdata_test.shape[1:], dtype=xdata_test.dtype)
```
which is equivalent to (max_x, 40, 24, 24) but dynamically typed.

Last, fill the numpy array:

curr_x = 0
for arr in list_of_arrays:
    final_data[curr_x:curr_x+arr.shape[0]] = arr
    curr_x += arr.shape[0]

上述循环将每个数组复制到先前定义的大数组的列/行中。

通过这样做，每个N个数组都将复制到确切的最终目的地，而不是为每个连接创建临时数组。

- Imanol Luengo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Benoit Seguin · Accepted Answer

据我所了解，对于numpy而言，所有的stack和concatenate函数并不是非常高效。这也是有充分理由的，因为numpy试图保持数组内存连续以提高效率（请参见有关numpy中连续数组的链接）。

这意味着每次连接操作都必须复制整个数据。当我需要将一堆元素连接在一起时，我倾向于使用以下方法：

l = []
for additional_X in ...:
    l.append(addiional_X)
xdata_test = np.concatenate(l)

那样，昂贵的整体数据移动操作只需进行一次。注意：我们对提高速度的改善很感兴趣。