Numpy中巨大数组的点积

Question

Numpy中巨大数组的点积

3

我有一个巨大的数组，想要用一个小数组计算点积。但是我得到了“数组太大”的错误信息。有没有什么解决方法？

import numpy as np

eMatrix = np.random.random_integers(low=0,high=100,size=(20000000,50))
pMatrix = np.random.random_integers(low=0,high=10,size=(50,50))

a = np.dot(eMatrix,pMatrix)

Error:
/Library/Python/2.7/site-packages/numpy/random/mtrand.so in mtrand.RandomState.random_integers (numpy/random/mtrand/mtrand.c:9385)()

/Library/Python/2.7/site-packages/numpy/random/mtrand.so in mtrand.RandomState.randint (numpy/random/mtrand/mtrand.c:7051)()

ValueError: array is too big.

- Lanc

2

这已经发生在eMatrix =时了，不是吗？你正在请求10^9个整数 - 一个GB乘以每个整数的字节数。因此，至少你应该将它们放入int8类型的数组中，而不是默认的int64。 - mdurant

但是我有一台64位机器，拥有16GB的RAM。 - Lanc

所以第一个ePrime需要8GB，第二个至少需要同样多的内存，还可能有一些未知的中间内存要求。 - mdurant

3个回答

0

我认为唯一“简单”的答案就是增加更多的内存。

虽然需要15GB，但我在我的MacBook上完成了这个任务。

In [1]: import numpy
In [2]: e = numpy.random.random_integers(low=0, high=100, size=(20000000, 50))
In [3]: p = numpy.random.random_integers(low=0, high=10, size=(50, 50))
In [4]: a = numpy.dot(e, p)
In [5]: a[0]
Out[5]:
array([14753, 12720, 15324, 13588, 16667, 16055, 14144, 15239, 15166,
       14293, 16786, 12358, 14880, 13846, 11950, 13836, 13393, 14679,
       15292, 15472, 15734, 12095, 14264, 12242, 12684, 11596, 15987,
       15275, 13572, 14534, 16472, 14818, 13374, 14115, 13171, 11927,
       14226, 13312, 16070, 13524, 16591, 16533, 15466, 15440, 15595,
       13164, 14278, 13692, 12415, 13314])

一种可能的解决方案是使用稀疏矩阵和稀疏矩阵点运算符。

例如，在我的计算机上，仅构建密集矩阵e就使用了8GB的内存。构建类似的稀疏矩阵eprime：

In [1]: from scipy.sparse import rand
In [2]: eprime = rand(20000000, 50)

在内存方面成本微不足道。

- stderr

我相信一旦你进行像点积这样的计算，你会再次得到一个密集矩阵。 - mdurant

嘿 @stderr，正如我上面提到的，我也在拥有16GB内存的Mac上尝试了它，但是失败了。 - Lanc

另外，我不想要一个稀疏矩阵，我的矩阵需要是密集的。 - Lanc

0

我认为答案是你的RAM不够，而且可能你正在运行32位版本的Python。也许需要澄清你正在运行什么操作系统。许多操作系统都可以运行32位和64位程序。

- beiller

我该如何检查我是否正在运行32位版本的Python？ - Lanc

如上所述，请参见以下链接以确定您正在运行的是64位还是32位Python可执行文件：https://dev59.com/VnM_5IYBdhLWcg3wXx_Z - beiller

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jaime · Accepted Answer

当计算数组的总大小时，如果超出了本机int类型的范围，就会引发该错误。请参见此处以查看确切的源代码行。

无论您的计算机是64位还是32位，只要您几乎肯定在运行32位版本的Python（和NumPy），就会发生这种情况。您可以通过执行以下操作来检查是否为此情况：

>>> import sys
>>> sys.maxsize
2147483647 # <--- 2**31 - 1, on a 64 bit version you would get 2**63 - 1

再说一遍，你的数组只有“仅有”的 20000000 * 50 = 1000000000，略小于 2**30。如果我在32位numpy上尝试复制你的结果，我会得到一个 MemoryError：

>>> np.random.random_integers(low=0,high=100,size=(20000000,50))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 1420, in mtrand.RandomState.random_integers (numpy\random\mtrand\mtrand.c:12943)
  File "mtrand.pyx", line 938, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10338)
MemoryError

除非我将大小增加到神奇的2 ** 31 - 1阈值以上

>>> np.random.random_integers(low=0,high=100,size=(2**30, 2))
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "mtrand.pyx", line 1420, in mtrand.RandomState.random_integers (numpy\random\mtrand\mtrand.c:12943)
  File "mtrand.pyx", line 938, in mtrand.RandomState.randint (numpy\random\mtrand\mtrand.c:10338)
ValueError: array is too big.

考虑到您的回溯信息和我的不同，我怀疑您正在使用较旧版本。请问在您的系统上输出什么：

>>> np.__version__
'1.10.0.dev-9c50f98'