在Spark矩阵上进行基本线性代数

Question

在Spark矩阵上进行基本线性代数

7

我正在尝试在作为Spark RowMatrix存储的矩阵上执行一些基本的线性代数操作（具体来说是转置、点积和逆），如此处所述（使用Python API）。遵循文档中的示例（对于我的情况，矩阵中会有更多行，因此需要使用Spark），假设我有像这样的矩阵：

from pyspark.mllib.linalg.distributed import RowMatrix
# Create an RDD of vectors.
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]])
# Create a RowMatrix from an RDD of vectors.
mat = RowMatrix(rows)

对于这样的分布式矩阵，是否存在用于执行矩阵转置和点积的例程，例如：

dot(mat.T,mat)

或矩阵求逆？

inverse(mat)

在文档中似乎找不到相关信息。我要么需要指向相关文档的指针，要么需要一种自己实现该功能的方法。

- moustachio

你必须将数据存储为Spark RowMatrix吗？在pandas中，像你想要的那样做很容易。 - Constantino

当我说“矩阵中将有更多的行”时，我想我本可以表达得更清楚。这些数据太大了，无法存储在内存中（因此也无法使用pandas）。如果可能的话，我会直接使用numpy数组和矩阵运算... - moustachio

2个回答

2

在 Spark 1.6 及更高版本中，您可以通过 BlockMatrix 类进行矩阵运算。在 Spark 1.6 中，只有乘法和加法可用。在 Spark 2.0 中，新增了更多运算符。截至本文撰写时，您需要手动实现逆矩阵，但是点积和转置矩阵是可用的。以下是一个 Spark 1.6 的示例：https://github.com/apache/spark/blob/branch-2.0/python/pyspark/mllib/linalg/distributed.py#L811。

from pyspark.mllib.linalg.distributed import IndexedRow, IndexedRowMatrix, BlockMatrix

sc = SparkContext()
rows = sc.parallelize([[1, 2, 3], [4, 5, 6], [7, 8, 9], [10, 11, 12]]) \
    .zipWithIndex()

# need a SQLContext() to generate an IndexedRowMatrix from RDD
sqlContext = SQLContext(sc)
rows = IndexedRowMatrix( \
    rows \
    .map(lambda row: IndexedRow(row[1], row[0])) \
    ).toBlockMatrix()

mat_product = rows.multiply(<SOME OTHER BLOCK MATRIX>)

- Paul Back

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- zero323 · Accepted Answer

目前（Spark 1.6.0），pyspark.mllib.linalg.distributed API仅限于基本操作，如计算行/列数和类型之间的转换。

Scala API支持更广泛的方法，包括乘法（RowMatrix.multiply，Indexed.RowMatrix.multiply），转置，SVD（IndexedRowMatrix.computeSVD），QR分解（RowMatrix.tallSkinnyQR），Gram矩阵计算（computeGramianMatrix），PCA（RowMatrix.computePrincipalComponents）等，可用于实现更复杂的线性代数函数。