初始化高维稀疏矩阵。

Question

初始化高维稀疏矩阵。

3

我想使用sklearn初始化一个300,000 x 300,0000的稀疏矩阵，但它需要的内存好像并不是稀疏矩阵的。

>>> from scipy import sparse
>>> sparse.rand(300000,300000,.1)

出现错误：

MemoryError: Unable to allocate 671. GiB for an array with shape (300000, 300000) and data type float64

这和使用numpy初始化时出现的错误是一样的:

np.random.normal(size=[300000, 300000])

即使我去到非常低的密度，它仍然会产生错误：

>>> from scipy import sparse
>>> from scipy import sparse
>>> sparse.rand(300000,300000,.000000000001)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File ".../python3.8/site-packages/scipy/sparse/construct.py", line 842, in rand
    return random(m, n, density, format, dtype, random_state)
  File ".../lib/python3.8/site-packages/scipy/sparse/construct.py", line 788, in random
    ind = random_state.choice(mn, size=k, replace=False)
  File "mtrand.pyx", line 980, in numpy.random.mtrand.RandomState.choice
  File "mtrand.pyx", line 4528, in numpy.random.mtrand.RandomState.permutation
MemoryError: Unable to allocate 671. GiB for an array with shape (90000000000,) and data type int64

有没有一种更节省内存的方式来创建这样的稀疏矩阵？

- rando

1

你在哪里指定矩阵种群的密度？据我所知，您正在非稀疏矩阵上使用稀疏数据结构。 - kpie

@kpie density=0.1是sparse.rand中的第三个参数。即使我选择更小的值（例如，density=0），它仍然会给出相同的错误。 - rando

3

sparse.rand函数使用choice方法从一个大小为300000*300000的整数空间中生成k个随机索引。我经常使用该函数生成稀疏矩阵的样本，但通常只针对10x10这样的合理测试用例。显然，这并不是一种无论你如何稀疏化都可以生成非常大矩阵的方法。最终的矩阵不会占用这么多空间，但是生成索引的方法需要暂时占用这么多空间。 - hpaulj

3

scipy.sparse 提供了多种创建稀疏矩阵的方法。其中一种常用方法是使用 3 个 coo 格式的数组，您可以选择索引和数据值。另一种较慢的方法是从正确形状的 lil 开始，并“随机”分配元素。sparse.random 只是一个方便的工具，用于创建测试矩阵，很少用于实际生产目的。 - hpaulj

3个回答

1

只生成你所需的内容。

from scipy import sparse
import numpy as np

n, m = 300000, 300000
density = 0.00000001
size = int(n * m * density)

rows = np.random.randint(0, n, size=size)
cols = np.random.randint(0, m, size=size)
data = np.random.rand(size)

arr = sparse.csr_matrix((data, (rows, cols)), shape=(n, m))

这让你能够构建超大的稀疏数组，只要它们足够稀疏而适合放入内存中。

>>> arr
<300000x300000 sparse matrix of type '<class 'numpy.float64'>'
    with 900 stored elements in Compressed Sparse Row format>

这可能是sparse.rand构造函数应该工作的方式。如果任何行列对冲突，它将把数据值相加，这对我所能想到的所有应用程序来说可能都很好。

- CJR

0

@hpaulj的评论非常准确。错误信息中也有提示。

MemoryError: 无法为形状为(90000000000,)和数据类型int64的数组分配671. GiB的内存

这里提到了int64而不是float64，以及一个大小为300,000 X 300,000的线性数组。这指的是在创建稀疏矩阵的随机抽样的中间步骤，它本身就占用了大量的内存。

请注意，在创建任何稀疏矩阵（无论格式如何）时，您都必须考虑非零值的内存和表示值在矩阵中位置的内存。

- Hari

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- kpie · Accepted Answer

尝试按照文档中所示传递一个合理的“density”参数...如果你有像一万亿个细胞这样的数量，可能需要使用0.00000001或类似值...

https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.rand.html#scipy.sparse.rand