Spark随机森林二进制分类器指标

Question

Spark随机森林二进制分类器指标

scalaapache-sparkapache-spark-mllib

6

如何在Spark Mllib中训练随机森林二元分类器模型时获取模型指标（F score，AUROC，AUPRC等）？

问题在于BinaryClassificationMetrics需要概率，而随机森林分类器的预测方法返回0或1的离散值。

参见：https://spark.apache.org/docs/latest/mllib-evaluation-metrics.html#binary-classification

RandomForest.trainClassifier没有任何clearThreshold方法，这会使它返回概率而不是离散的0或1标签。

- Răzvan Flavius Panda

2

可能是[Spark 1.5.1，MLLib随机森林概率]的重复问题。 - eliasah

@eliasah 这并不是一个重复的问题，但那里的答案提供了问题的解决方案。在你评论之前，我已经在答案中添加了它。 - Răzvan Flavius Panda

没问题。因此使用“可能”这个词是可以的。 - eliasah

@eliasah，那个问题实际上不是重复的，因为它并没有询问度量标准。虽然那里的答案指向了新的 ml API，可以帮助找到解决方案。请查看更新的答案，其中包括了 Apache 文档示例并针对此问题进行了调整。 - Răzvan Flavius Panda

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Răzvan Flavius Panda · Accepted Answer

我们需要使用基于新的ml DataFrames API而不是基于RDD的mllib API来获取概率。更新以下是来自Spark文档的更新示例，使用BinaryClassificationEvaluator并显示指标：接收器操作特性下面积（AUROC）和精度-召回曲线下面积（AUPRC）。

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.BinaryClassificationEvaluator
import org.apache.spark.ml.feature.{IndexToString, StringIndexer, VectorIndexer}

// Load and parse the data file, converting it to a DataFrame.
val data = sqlContext.read.format("libsvm").load("D:/Sources/spark/data/mllib/sample_libsvm_data.txt")

// Index labels, adding metadata to the label column.
// Fit on whole dataset to include all labels in index.
val labelIndexer = new StringIndexer()
  .setInputCol("label")
  .setOutputCol("indexedLabel")
  .fit(data)

// Automatically identify categorical features, and index them.
// Set maxCategories so features with > 4 distinct values are treated as continuous.
val featureIndexer = new VectorIndexer()
  .setInputCol("features")
  .setOutputCol("indexedFeatures")
  .setMaxCategories(4)
  .fit(data)

// Split the data into training and test sets (30% held out for testing)
val Array(trainingData, testData) = data.randomSplit(Array(0.7, 0.3))

// Train a RandomForest model.
val rf = new RandomForestClassifier()
  .setLabelCol("indexedLabel")
  .setFeaturesCol("indexedFeatures")
  .setNumTrees(10)

// Convert indexed labels back to original labels.
val labelConverter = new IndexToString()
  .setInputCol("prediction")
  .setOutputCol("predictedLabel")
  .setLabels(labelIndexer.labels)

// Chain indexers and forest in a Pipeline
val pipeline = new Pipeline()
  .setStages(Array(labelIndexer, featureIndexer, rf, labelConverter))

// Train model.  This also runs the indexers.
val model = pipeline.fit(trainingData)

// Make predictions.
val predictions = model.transform(testData)

// Select example rows to display.
predictions
  .select("indexedLabel", "rawPrediction", "prediction")
  .show()

val binaryClassificationEvaluator = new BinaryClassificationEvaluator()
  .setLabelCol("indexedLabel")
  .setRawPredictionCol("rawPrediction")

def printlnMetric(metricName: String): Unit = {
  println(metricName + " = " + binaryClassificationEvaluator.setMetricName(metricName).evaluate(predictions))
}

printlnMetric("areaUnderROC")
printlnMetric("areaUnderPR")