如何在Spark Pipeline中使用随机森林算法？

Question

如何在Spark Pipeline中使用随机森林算法？

apache-sparkapache-spark-mllibpipelinerandom-forestapache-spark-ml

4

我希望使用Spark进行模型调优，采用网格搜索和交叉验证。在Spark中，必须将基础模型放在管道中，管道的官方演示使用LogisticRegression作为基础模型，可以作为一个新对象。然而，客户端代码无法创建RandomForest模型，因此似乎无法在管道API中使用RandomForest。我不想重新发明轮子，所以有人能给出一些建议吗？谢谢

- bourneli

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- zero323 · Accepted Answer

然而，RandomForest模型不能被客户端代码新建，所以似乎无法在管道API中使用RandomForest。

这是真的，但是你只是尝试使用了错误的类。你应该使用ml.classification.RandomForestClassifier而不是mllib.tree.RandomForest。这里有一个基于MLlib文档的例子。

import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.feature.StringIndexer
import org.apache.spark.mllib.linalg.Vector
import org.apache.spark.mllib.util.MLUtils
import sqlContext.implicits._ 

case class Record(category: String, features: Vector)

val data = MLUtils.loadLibSVMFile(sc, "data/mllib/sample_libsvm_data.txt")
val splits = data.randomSplit(Array(0.7, 0.3))
val (trainData, testData) = (splits(0), splits(1))

val trainDF = trainData.map(lp => Record(lp.label.toString, lp.features)).toDF
val testDF = testData.map(lp => Record(lp.label.toString, lp.features)).toDF

val indexer = new StringIndexer()
  .setInputCol("category")
  .setOutputCol("label")

val rf  = new RandomForestClassifier()
    .setNumTrees(3)
    .setFeatureSubsetStrategy("auto")
    .setImpurity("gini")
    .setMaxDepth(4)
    .setMaxBins(32)

val pipeline = new Pipeline()
    .setStages(Array(indexer, rf))

val model = pipeline.fit(trainDF)

model.transform(testDF)

这里有一件事情我无法理解。据我所知，应该可以直接使用从LabeledPoints中提取的标签，但由于某种原因它不起作用，pipeline.fit会引发IllegalArgumentExcetion：

RandomForestClassifier给出了带有无效标签列标签的输入，未指定类别数。

因此，使用StringIndexer进行丑陋的技巧。应用后，我们得到所需的属性（{"vals":["1.0","0.0"],"type":"nominal","name":"label"}），但是ml中的某些类似乎可以正常工作而无需使用它。