能够对压缩稀疏列矩阵执行np.diff()的Scipy函数

Question

能够对压缩稀疏列矩阵执行np.diff()的Scipy函数

3

我想计算单位矩阵的离散差分。下面的代码使用numpy和scipy。

import numpy as np
from scipy.sparse import identity
from scipy.sparse import csc_matrix

x = identity(4).toarray()
y = csc_matrix(np.diff(x, n=2))
print(y)

我希望能够提高性能或内存使用率。由于单位矩阵产生许多零，因此在压缩稀疏列(csc)格式中执行计算可以减少内存使用。然而，np.diff()不接受csc格式，因此使用csc_matrix在csc和普通格式之间进行转换会稍微降低速度。

普通格式

x = identity(4).toarray()
print(x)
[[1. 0. 0. 0.]
 [0. 1. 0. 0.]
 [0. 0. 1. 0.]
 [0. 0. 0. 1.]]

CSC格式

x = identity(4)
print(x)
  (0, 0)    1.0
  (1, 1)    1.0
  (2, 2)    1.0
  (3, 3)    1.0

谢谢

- Andythe_great

你的数组的真实大小是多少（假设这个(4,4)仅用于演示）？你真的需要使用 np.diff 吗？你不能只用几条对角线创建矩阵吗？ - hpaulj

@hpaulj 只是为了演示，它可以达到（4000,4000）或更高。我使用np.diff()，因为在这种情况下我可以轻松地更改n的值，这里n=2。 - Andythe_great

我认为你需要检查diff代码并进行自己的计算。从稀疏的角度来看，这并不是很大。稀疏矩阵乘法和对非零项进行操作的ufunc是很好的选择。但对于这种类型的计算来说，并没有节省多少时间，如果有的话。 - hpaulj

一般来说，只有稀疏模块中的函数（和方法）才能正确地操作它们。 - hpaulj

2个回答

1

这是我用了一种hacky的方法来获取您想要的稀疏矩阵。

L - 原始单位矩阵的长度，
n - np.diff的参数。

在你的问题中它们是：

L = 4
n = 2

我的代码产生与您的代码相同的 y 值，但没有在csc和普通格式之间进行转换。

您的代码：

from scipy.sparse import identity, csc_matrix

x = identity(L).toarray()
y = csc_matrix(np.diff(x, n=n))

我的代码：

from scipy.linalg import pascal

def get_data(n, L):
    nums = pascal(n + 1, kind='lower')[-1].astype(float)
    minuses_from = n % 2 + 1
    nums[minuses_from : : 2] *= -1
    return np.tile(nums, L - n)

data = get_data(n, L)
row_ind = (np.arange(n + 1) + np.arange(L - n).reshape(-1, 1)).flatten()
col_ind = np.repeat(np.arange(L - n), n + 1)

y = csc_matrix((data, (row_ind, col_ind)), shape=(L, L - n))

我注意到，将 np.diff 应用于单位矩阵 n 次后，列的值是二项式系数，其符号交替。这是我的变量 data。

然后我只需构建 csc_matrix。

- Vladimir Fokow

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- 7shoe · Accepted Answer

很遗憾，似乎SciPy没有提供这种稀疏矩阵操作的工具。但是，通过巧妙地操作条目的索引和数据，可以直接模拟np.diff(x,n)。

给定维度为MxN的二维NumPy数组（矩阵），np.diff()将每个列（列索引为y）乘以-1并将下一列添加到它（列索引为y+1）。顺序为k的差分只是对1级差分进行k次迭代应用。顺序为0的差分只是返回输入矩阵。

下面的方法利用这一点，通过sum_duplicates()迭代地消除重复条目，减少一列，并过滤非有效索引。

def csc_diff(x, n):
    '''Emulates np.diff(x,n) for a sparse matrix by iteratively taking difference of order 1'''
    assert isinstance(x, csc_matrix) or (isinstance(x, np.ndarray) & len(x.shape) == 2), "Input matrix must be a 2D np.ndarray or csc_matrix."
    assert isinstance(n, int) & n >= 0, "Integer n must be larger or equal to 0."
    
    if n >= x.shape[1]:
        return csc_matrix(([], ([], [])), shape=(x.shape[0], 0))
    
    if isinstance(x, np.ndarray):
        x = csc_matrix(x)
        
    # set-up of data/indices via column-wise difference
    if(n > 0):
        for k in range(1,n+1):
            # extract data/indices of non-zero entries of (current) sparse matrix
            M, N = x.shape
            idx, idy = x.nonzero()
            dat = x.data
        
            # difference: this row (y) * (-1) + next row (y+1)
            idx = np.concatenate((idx, idx))
            idy = np.concatenate((idy, idy-1))
            dat = np.concatenate(((-1)*dat, dat))
            
            # filter valid indices
            validInd = (0<=idy) & (idy<N-1)

            # x_diff: csc_matrix emulating np.diff(x,1)'s output'
            x_diff =  csc_matrix((dat[validInd], (idx[validInd], idy[validInd])), shape=(M, N-1))
            x_diff.sum_duplicates()
            
            x = x_diff

    return x

此外，当差分阶数大于或等于输入矩阵的列数时，该方法会输出一个维度为 Mx0 的空 csc_matrix。这就是为什么输出结果是相同的原因，请参见。

csc_diff(x, 2).toarray()
> array([[ 1.,  0.],
         [-2.,  1.],
         [ 1., -2.],
         [ 0.,  1.]])

这与

np.diff(x.toarray(), 2)
> array([[ 1.,  0.],
         [-2.,  1.],
         [ 1., -2.],
         [ 0.,  1.]])

这个等式对于其他差分阶数也成立。

(csc_diff(x, 0).toarray() == np.diff(x.toarray(), 0)).all()
>True

(csc_diff(x, 3).toarray() == np.diff(x.toarray(), 3)).all()
>True

(csc_diff(x, 13).toarray() == np.diff(x.toarray(), 13)).all()
>True