我有一个稀疏矩阵(22000x97482)以csr格式存储,我想删除一些列(要删除的列索引号存储在列表中)。
我有一个稀疏矩阵(22000x97482)以csr格式存储,我想删除一些列(要删除的列索引号存储在列表中)。
如果您有非常多的列,那么生成完整的列索引集可能会变得非常昂贵。一个稍微更快的替代方法是暂时转换为COO格式:
import numpy as np
from scipy import sparse
def dropcols_fancy(M, idx_to_drop):
idx_to_drop = np.unique(idx_to_drop)
keep = ~np.in1d(np.arange(M.shape[1]), idx_to_drop, assume_unique=True)
return M[:, np.where(keep)[0]]
def dropcols_coo(M, idx_to_drop):
idx_to_drop = np.unique(idx_to_drop)
C = M.tocoo()
keep = ~np.in1d(C.col, idx_to_drop)
C.data, C.row, C.col = C.data[keep], C.row[keep], C.col[keep]
C.col -= idx_to_drop.searchsorted(C.col) # decrement column indices
C._shape = (C.shape[0], C.shape[1] - len(idx_to_drop))
return C.tocsr()
检查等价性:
m, n, d = 1000, 2000, 20
M = sparse.rand(m, n, format='csr')
idx_to_drop = np.random.randint(0, n, d)
M_drop1 = dropcols_fancy(M, idx_to_drop)
M_drop2 = dropcols_coo(M, idx_to_drop)
print(np.all(M_drop1.A == M_drop2.A))
# True
基准测试:
In [1]: m, n = 1000, 1000000
In [2]: %%timeit M = sparse.rand(m, n, format='csr')
...: dropcols_fancy(M, idx_to_drop)
...:
1 loops, best of 3: 1.11 s per loop
In [3]: %%timeit M = sparse.rand(m, n, format='csr')
...: dropcols_coo(M, idx_to_drop)
...:
1 loops, best of 3: 365 ms per loop
searchsorted
替代 greater_outer
来获得更快的速度。最好也调用一下 unique(idx_to_drop)
:-) - user2379410searchsorted
来递减列索引相对于使用广播是一个重大改进,特别是当idx_to_drop
很大时。我真希望我自己能想到它! - ali_mcsr_matrix
,其中包含您列表中拥有的列:all_cols = np.arange(old_m.shape[1])
cols_to_keep = np.where(np.logical_not(np.in1d(all_cols, cols_to_delete)))[0]
m = old_m[:, cols_to_keep]
sparse
中,将矩阵转换为最佳类型后再进行操作是常见的做法。 - hpauljcsr
矩阵,X[I,:]
比X[:,I]
快大约10倍。X.tocsc[:,I]
比X[:,I]
稍微快一点。因此,如果您需要频繁进行列切片操作,将矩阵转换为csc
格式值得额外的步骤。 - hpaulj