如何交叉验证随机森林模型?

22

我想评估正在训练一些数据的随机森林。Apache Spark中是否有任何实用程序可以执行相同的操作,还是我必须手动执行交叉验证?

2个回答

39

ML提供了CrossValidator类,可用于执行交叉验证和参数搜索。假设你的数据已经预处理好了,可以按照以下方式添加交叉验证:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.classification.RandomForestClassifier
import org.apache.spark.ml.evaluation.MulticlassClassificationEvaluator

// [label: double, features: vector]
trainingData org.apache.spark.sql.DataFrame = ??? 
val nFolds: Int = ???
val numTrees: Int = ???
val metric: String = ???

val rf = new RandomForestClassifier()
  .setLabelCol("label")
  .setFeaturesCol("features")
  .setNumTrees(numTrees)

val pipeline = new Pipeline().setStages(Array(rf)) 

val paramGrid = new ParamGridBuilder().build() // No parameter search

val evaluator = new MulticlassClassificationEvaluator()
  .setLabelCol("label")
  .setPredictionCol("prediction")
  // "f1" (default), "weightedPrecision", "weightedRecall", "accuracy"
  .setMetricName(metric) 

val cv = new CrossValidator()
  // ml.Pipeline with ml.classification.RandomForestClassifier
  .setEstimator(pipeline)
  // ml.evaluation.MulticlassClassificationEvaluator
  .setEvaluator(evaluator) 
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(nFolds)

val model = cv.fit(trainingData) // trainingData: DataFrame

使用 PySpark:

from pyspark.ml import Pipeline
from pyspark.ml.classification import RandomForestClassifier
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
from pyspark.ml.evaluation import MulticlassClassificationEvaluator

trainingData = ... # DataFrame[label: double, features: vector]
numFolds = ... # Integer

rf = RandomForestClassifier(labelCol="label", featuresCol="features")
evaluator = MulticlassClassificationEvaluator() # + other params as in Scala    

pipeline = Pipeline(stages=[rf])
paramGrid = (ParamGridBuilder. 
    .addGrid(rf.numTrees, [3, 10])
    .addGrid(...)  # Add other parameters
    .build())

crossval = CrossValidator(
    estimator=pipeline,
    estimatorParamMaps=paramGrid,
    evaluator=evaluator,
    numFolds=numFolds)

model = crossval.fit(trainingData)

你确定这个适用于留一法吗?底层的kFold()调用似乎不能确定性地返回两个长度为N-1和1的折叠。当我使用RegressionEvaluator和Lasso模型运行上面的代码时,我得到了以下结果:Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Nothing has been added to this summarizer. - paradiso
5
不,我相当确定它不会。MLUtils.kFold使用BernoulliCellSampler来确定拆分方式。另一方面,在Spark中执行留一法交叉验证的成本可能太高了,无法在实践中实现。 - zero323
你好 @zero323,当你在Evaluator对象中设置一个指标,比如.setMetricName("precision")。我的问题是,在训练过程中如何获取这些度量值?(请参考此问题:http://stackoverflow.com/questions/37778532/how-to-get-precision-recall-using-crossvalidator-for-training-naivebayes-model-u) - dbustosp
嘿 @zero323 ,在使用交叉验证时,是否需要将数据分成训练/测试集?由于CV会在多个折叠上进行训练和测试,因此应该给出五个折叠的训练/测试准确性的平均值吧?或者我可能完全错了。 - other15
1
@zero323 我认为你应该根据https://issues.apache.org/jira/browse/SPARK-15771,将“precision”更改为“accuracy”。 - user299791
显示剩余4条评论

2

在zero323的优秀答案基础上,使用随机森林回归器的示例类似,以下是一个类似的示例:

import org.apache.spark.ml.Pipeline
import org.apache.spark.ml.tuning.{ParamGridBuilder, CrossValidator}
import org.apache.spark.ml.regression.RandomForestRegressor // CHANGED
import org.apache.spark.ml.evaluation.RegressionEvaluator // CHANGED
import org.apache.spark.ml.feature.{VectorAssembler, VectorIndexer}

val numFolds = ??? // Integer
val data = ??? // DataFrame

// Training (80%) and test data (20%)
val Array(train, test) = data.randomSplit(Array(0.8,0.2))
val featuresCols = data.columns
val va = new VectorAssembler()
va.setInputCols(featuresCols)
va.setOutputCol("rawFeatures")
val vi = new VectorIndexer()
vi.setInputCol("rawFeatures")
vi.setOutputCol("features")
vi.setMaxCategories(5)
val regressor = new RandomForestRegressor()
regressor.setLabelCol("events")

val metric = "rmse"
val evaluator = new RegressionEvaluator()
  .setLabelCol("events")
  .setPredictionCol("prediction")
  //     "rmse" (default): root mean squared error
  //     "mse": mean squared error
  //     "r2": R2 metric
  //     "mae": mean absolute error 
  .setMetricName(metric) 

val paramGrid = new ParamGridBuilder().build()
val cv = new CrossValidator()
  .setEstimator(regressor)
  .setEvaluator(evaluator) 
  .setEstimatorParamMaps(paramGrid)
  .setNumFolds(numFolds)

val model = cv.fit(train) // train: DataFrame
val predictions = model.transform(test)
predictions.show
val rmse = evaluator.evaluate(predictions)
println(rmse)

评估器指标来源: https://spark.apache.org/docs/latest/api/scala/#org.apache.spark.ml.evaluation.RegressionEvaluator


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接