以下内容可能作为一个通用概念,但您需要自己解决许多细节问题...您应该首先熟悉
CSR格式,其中一个数组中的所有信息存储在3个数组中,两个数组的长度为非零条目的数量,另一个数组的长度为行数加一:
>>> import scipy.sparse as sps
>>> a = sps.rand(10, 10, density=0.05, format='csr')
>>> a.toarray()
array([[ 0. , 0.46531486, 0.03849468, 0.51743202, 0. ],
[ 0. , 0.67028033, 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0. , 0. , 0.9967058 ],
[ 0. , 0. , 0. , 0. , 0. ]])
>>> a.data
array([ 0.46531486, 0.03849468, 0.51743202, 0.67028033, 0.9967058 ])
>>> a.indices
array([1, 2, 3, 1, 4])
>>> a.indptr
array([0, 3, 4, 4, 5, 5])
a.data
存储非零元素,按行主序排列;a.indices
存储相应的非零元素所在的列索引;a.indptr
存储其他两个数组中每行数据的起始索引,例如a.indptr[3] = 4
和a.indptr[3+1] = 5
,表示第四行的非零元素在a.data[4:5]
中,对应的列索引在a.indices[4:5]
中。
你可以将这三个数组存储在磁盘上,并作为内存映射访问它们,然后可以按以下方式检索从m到n的行:
ip = indptr[m:n+1].copy()
d = data[ip[0]:ip[-1]]
i = indices[ip[0]:ip[-1]]
ip -= ip[0]
rows = sps.csr_matrix((d, i, ip))
作为一般的概念证明:
>>> c = sps.rand(1000, 10, density=0.5, format='csr')
>>> ip = c.indptr[20:25+1].copy()
>>> d = c.data[ip[0]:ip[-1]]
>>> i = c.indices[ip[0]:ip[-1]]
>>> ip -= ip[0]
>>> rows = sps.csr_matrix((d, i, ip))
>>> rows.toarray()
array([[ 0. , 0. , 0. , 0. , 0.55683501,
0.61426248, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0.67789204, 0. , 0.71821363,
0.01409666, 0. , 0. , 0.58965142, 0. ],
[ 0. , 0. , 0. , 0.1575835 , 0.08172986,
0.41741147, 0.72044269, 0. , 0.72148343, 0. ],
[ 0. , 0.73040998, 0.81507086, 0.13405909, 0. ,
0. , 0.82930945, 0.71799358, 0.8813616 , 0.51874795],
[ 0.43353831, 0.00658204, 0. , 0. , 0. ,
0.10863725, 0. , 0. , 0. , 0.57231074]])
>>> c[20:25].toarray()
array([[ 0. , 0. , 0. , 0. , 0.55683501,
0.61426248, 0. , 0. , 0. , 0. ],
[ 0. , 0. , 0.67789204, 0. , 0.71821363,
0.01409666, 0. , 0. , 0.58965142, 0. ],
[ 0. , 0. , 0. , 0.1575835 , 0.08172986,
0.41741147, 0.72044269, 0. , 0.72148343, 0. ],
[ 0. , 0.73040998, 0.81507086, 0.13405909, 0. ,
0. , 0.82930945, 0.71799358, 0.8813616 , 0.51874795],
[ 0.43353831, 0.00658204, 0. , 0. , 0. ,
0.10863725, 0. , 0. , 0. , 0.57231074]])