I have a big csr_matrix(1M*1K) and I want to add over rows and obtain a new csr_matrix with the same number of columns but reduced number of rows. Actually my problem is exactly same as this Sum over rows in scipy.sparse.csr_matrix. The only thing is I find the accepted solution to be slow for my purpose. Let me state what I have
map_fn = np.random.randint(0, 10000, 1000000)
map_fn
here tells me how my input rows(1M) are mapped into my output rows(10K). For example ith input row gets added up into map_fn[i]
output row. I tried the two approaches mentioned in the above question,
namely forming a sparse matrix and using sparse sum. Although the sparse matrix approach looks way better than sparse sum approach but I find it slow for my purpose. Here is the code comparing two approaches:
import scipy.sparse
import numpy as np
import time
print "Setting up input"
s=10000
n=1000000
d=1000
density=1.0/500
X=scipy.sparse.rand(n,d,density=density,format="csr")
map_fn=np.random.randint(0, s, n)
# Approach 1
start_time=time.time()
col = scipy.arange(n)
val = np.ones(n)
S = scipy.sparse.csr_matrix( (val, (map_fn, col)), shape = (s,n))
print "Approach 1 Creation time : ",time.time()-start_time
SX = S.dot(X)
print "Approach 1 Total time : ",time.time()-start_time
#Approach 2
start_time=time.time()
SX = np.zeros((s,X.shape[1]))
for i in range(SX.shape[0]):
SX[i,:] = X[np.where(map_fn==i)[0],:].sum(axis=0)
print "Approach 2 Total time : ",time.time()-start_time
which gives following numbers:
Approach 1 Creation time : 0.187678098679
Approach 1 Total time : 0.286989927292
Approach 2 Total time : 10.208632946
So my question is this is there a better way of doing this? I find forming sparse matrix to be an overkill as it takes more than half of the time. Are there any better alternatives? Any suggestions are greatly appreciated. Thanks