我对使用PySpark非常陌生。我的PySpark数据框中有一列SparseVectors。
rescaledData.select('features').show(5,False)
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|features |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
|(262144,[43953,62425,66522,148962,174441,249180],[3.9219733362813143,3.9219733362813143,1.213923135179104,3.9219733362813143,3.9219733362813143,0.5720692490067093])|
|(262144,[57925,66522,90939,249180],[3.5165082281731497,1.213923135179104,3.9219733362813143,0.5720692490067093]) |
|(262144,[23366,45531,73408,211290],[2.6692103677859462,3.005682604407159,3.5165082281731497,3.228826155721369]) |
|(262144,[30913,81939,99546,137643,162885,249180],[3.228826155721369,3.9219733362813143,3.005682604407159,3.005682604407159,3.228826155721369,1.1441384980134186]) |
|(262144,[108134,152329,249180],[3.9219733362813143,2.6692103677859462,2.8603462450335466]) |
+--------------------------------------------------------------------------------------------------------------------------------------------------------------------+
我可以帮助您进行翻译。以下数据框需要转换为矩阵,其中矩阵中的每一行对应于数据框中该行中的SparseVector。例如:
+-----------------+
|features |
+-----------------+
|(7,[1,2],[45,63])|
|(7,[3,5],[85,69])|
|(7,[1,2],[89,56])|
+-----------------+
必须转换为
[[0,45,63,0,0,0,0]
[0,0,0,85,0,69,0]
[0,89,56,0,0,0,0]]
我已阅读下面的链接,其中显示有一个名为
toArray()
的函数可以完全满足我的需求。
https://mingchen0919.github.io/learning-apache-spark/pyspark-vectors.html
然而,我在使用它时遇到了麻烦。vector_udf = udf(lambda vector: vector.toArray())
rescaledData.withColumn('features_', vector_udf(rescaledData.features)).first()
我需要将每一行转换为一个数组,然后将PySpark数据框转换为矩阵。