Spark 2.2.0 adds相关性支持到数据框架中。 有关更多信息,请参见pull request。
MLlib 基于DataFrame的API中新增了一些算法:
SPARK-19636: 基于DataFrame的API中的相关性(Scala/Java/Python)
然而,如何使用此更改或与以前版本相比发生了什么变化完全不清楚。
我期望得到类似于:
df_num = spark.read.parquet('/dataframe')
df_cat.printSchema()
df_cat.show()
df_num.corr(col1='features', col2='fail_mode_meas')
root
|-- features: vector (nullable = true)
|-- fail_mode_meas: double (nullable = true)
+--------------------+--------------+
| features|fail_mode_meas|
+--------------------+--------------+
|[0.0,0.5,0.0,0.0,...| 22.7|
|[0.9,0.0,0.7,0.0,...| 0.1|
|[0.0,5.1,1.0,0.0,...| 2.0|
|[0.0,0.0,0.0,0.0,...| 3.1|
|[0.1,0.0,0.0,1.7,...| 0.0|
...
pyspark.sql.utils.IllegalArgumentException: 'requirement failed: Currently correlation calculation for columns with dataType org.apach
e.spark.ml.linalg.VectorUDT not supported.'
有人能解释一下如何利用Spark 2.2.0中的数据帧相关性新功能吗?
features
”。 - DachuanZhao