高效规范化Scipy稀疏矩阵的方法

Question

高效规范化Scipy稀疏矩阵的方法

32

我想编写一个函数，将大型稀疏矩阵的各行标准化（使它们加起来等于一）。

from pylab import *
import scipy.sparse as sp

def normalize(W):
    z = W.sum(0)
    z[z < 1e-6] = 1e-6
    return W / z[None,:]

w = (rand(10,10)<0.1)*rand(10,10)
w = sp.csr_matrix(w)
w = normalize(w)

然而，这会导致以下异常：

File "/usr/lib/python2.6/dist-packages/scipy/sparse/base.py", line 325, in __div__
     return self.__truediv__(other)
File "/usr/lib/python2.6/dist-packages/scipy/sparse/compressed.py", line 230, in  __truediv__
   raise NotImplementedError

有没有比较简单的解决方案？我看过这个，但我仍然不清楚如何实际执行除法。

- sterne

3

这基本上是一个重复的问题：https://dev59.com/RGfWa4cB1Zd3GeqPmPu0，因为它不关心是按行进行逐元素相乘还是相除。当然，如果有更好的答案，那就太好了 :) - seberg

我不同意，这是一个不同的问题。你指出的重复问题进行了逐元素乘法，而这个问题似乎想要将每一行除以不同的值（而不是将所有非零元素除以相同的值）。Aaron McDaid在下面提供的解决方案应该能够高效地工作（并且不需要复制任何数据）。 - conradlee

据我所知，这是 https://dev59.com/rF3Va4cB1Zd3GeqPEuFj 的一个副本。 - Emmet

6个回答

3

虽然Aaron的答案是正确的，但当我想要将数据归一化到最大的绝对值时，sklearn没有提供相应的解决方案。我的方法使用非零元素，并在csr_matrix.data数组中查找它们以快速替换其中的值。

def normalize_sparse(csr_matrix):
    nonzero_rows = csr_matrix.nonzero()[0]
    for idx in np.unique(nonzero_rows):
        data_idx = np.where(nonzero_rows==idx)[0]
        abs_max = np.max(np.abs(csr_matrix.data[data_idx]))
        if abs_max != 0:
            csr_matrix.data[data_idx] = 1./abs_max * csr_matrix.data[data_idx]

与 sunan 的方案相比，这种方法不需要将矩阵转换为密集格式（可能会引起内存问题），也不需要进行矩阵乘法。我在一个形状为（35'000，486'000）的稀疏矩阵上测试了这种方法，只用了约18秒。

- AlexConfused

你测试的 (35'000, 486'000) 矩阵的填充比是多少？ - Anton Antonov

1

1,657,114个条目是非零的，这导致在我的情况下比率约为0.0096%。 - AlexConfused

2

这是我的解决方案。

transpose A
calculate sum of each col
format diagonal matrix B with reciprocal of sum
A*B equals normalization

transpose C

import scipy.sparse as sp
import numpy as np
import math

minf = 0.0001

A = sp.lil_matrix((5,5))
b = np.arange(0,5)
A.setdiag(b[:-1], k=1)
A.setdiag(b)
print A.todense()
A = A.T
print A.todense()

sum_of_col = A.sum(0).tolist()
print sum_of_col
c = []
for i in sum_of_col:
    for j in i:
        if math.fabs(j)<minf:
            c.append(0)
        else:
            c.append(1/j)

print c

B = sp.lil_matrix((5,5))
B.setdiag(c)
print B.todense()

C = A*B
print C.todense()
C = C.T
print C.todense()

- sunan

我认为可以避免进行这么多的转置操作。我们只需要缩放行而不是列。https://dev59.com/iGct5IYBdhLWcg3wKqQd#59365444 - Abdus Khazi

1

我发现这是一种优雅的方法，可以在不使用内置函数的情况下完成。

import scipy.sparse as sp

def normalize(W):
    #Find the row scalars as a Matrix_(n,1)
    rowSumW = sp.csr_matrix(W.sum(axis=1))
    rowSumW.data = 1/rowSumW.data

    #Find the diagonal matrix to scale the rows
    rowSumW = rowSumW.transpose()
    scaling_matrix = sp.diags(rowSumW.toarray()[0])

    return scaling_matrix.dot(W)

- Abdus Khazi

0

不导入sklearn，不转换为密集矩阵，也不乘以矩阵，而是利用csr矩阵的数据表示：

from scipy.sparse import isspmatrix_csr

def normalize(W):
    """ row normalize scipy sparse csr matrices inplace.
    """
    if not isspmatrix_csr(W):
        raise ValueError('W must be in CSR format.')
    else:
        for i in range(W.shape[0]):
            row_sum = W.data[W.indptr[i]:W.indptr[i+1]].sum()
            if row_sum != 0:
                W.data[W.indptr[i]:W.indptr[i+1]] /= row_sum

记住W.indices是列索引数组， W.data是相应非零值的数组，而W.indptr指向索引和数据中的行起始位置。

如果需要L1范数，可以在求和时添加numpy.abs()，或者使用numpy.max()按每行最大值进行归一化。

- Alex B

0

我的设置：Python 3.8.10，SciPy 1.5.4

仅使用csr_array，您现在可以使用其multiply和sum方法并执行以下操作：

# w : scipy.sparse.csr_array
row_sum = w.sum(axis=1)
row_sum[row_sum == 0] = 1  # to avoid divide by zero
w = w.multiply(1. / row_sum)

- cvollmann

由于这是一个相当古老的问题，因此最好提及您的Python版本以及任何关键软件包（及其版本）。 - Ftagliacarne

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Aaron McDaid · Accepted Answer

这已经在scikit-learn的sklearn.preprocessing.normalize中实现。

from sklearn.preprocessing import normalize
w_normalized = normalize(w, norm='l1', axis=1)

axis=1 表示按行归一化，axis=0 表示按列归一化。使用可选参数 copy=False 可以就地修改矩阵。