如何从PySpark中的spark.ml中提取模型超参数?

38

我正在尝试使用 PySpark 文档中的交叉验证代码,并尝试让 PySpark 告诉我选择了哪个模型:

from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.mllib.linalg import Vectors
from pyspark.ml.tuning import ParamGridBuilder, CrossValidator

dataset = sqlContext.createDataFrame(
    [(Vectors.dense([0.0]), 0.0),
     (Vectors.dense([0.4]), 1.0),
     (Vectors.dense([0.5]), 0.0),
     (Vectors.dense([0.6]), 1.0),
     (Vectors.dense([1.0]), 1.0)] * 10,
    ["features", "label"])
lr = LogisticRegression()
grid = ParamGridBuilder().addGrid(lr.regParam, [0.1, 0.01, 0.001, 0.0001]).build()
evaluator = BinaryClassificationEvaluator()
cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, evaluator=evaluator)
cvModel = cv.fit(dataset)

我在PySpark shell中运行这个程序,可以得到线性回归模型的系数,但是似乎找不到交叉验证过程中选择的lr.regParam的值。有任何想法吗?

In [3]: cvModel.bestModel.coefficients
Out[3]: DenseVector([3.1573])

In [4]: cvModel.bestModel.explainParams()
Out[4]: ''

In [5]: cvModel.bestModel.extractParamMap()
Out[5]: {}

In [15]: cvModel.params
Out[15]: []

In [36]: cvModel.bestModel.params
Out[36]: []

4
Spark Scala API 中相关问题:如何从 CrossValidatorModel 中提取最佳参数。 - desertnaut
在这里查看pyspark答案:https://dev59.com/8Zrga4cB1Zd3GeqPsc9z - marilena.oita
请确保标记答案(下面的wernerchao对我有用)。 - Ross Lewis
我会相信你的话,尽管这个项目对我来说现在已经是遥远的记忆了... - Paul
8个回答

45

我也遇到了这个问题。我发现你需要调用Java属性,但我不知道为什么要这样做。所以只需这样做:

from pyspark.ml.tuning import TrainValidationSplit, ParamGridBuilder, CrossValidator
from pyspark.ml.regression import LinearRegression
from pyspark.ml.evaluation import RegressionEvaluator

evaluator = RegressionEvaluator(metricName="mae")
lr = LinearRegression()
grid = ParamGridBuilder().addGrid(lr.maxIter, [500]) \
                                .addGrid(lr.regParam, [0]) \
                                .addGrid(lr.elasticNetParam, [1]) \
                                .build()
lr_cv = CrossValidator(estimator=lr, estimatorParamMaps=grid, \
                        evaluator=evaluator, numFolds=3)
lrModel = lr_cv.fit(your_training_set_here)
bestModel = lrModel.bestModel

打印出你想要的参数:

>>> print 'Best Param (regParam): ', bestModel._java_obj.getRegParam()
0
>>> print 'Best Param (MaxIter): ', bestModel._java_obj.getMaxIter()
500
>>> print 'Best Param (elasticNetParam): ', bestModel._java_obj.getElasticNetParam()
1

这也适用于其他方法,例如extractParamMap()。他们应该很快修复这个问题。


6
不错的捕捉。比修复更好的是一个像 cvModel.getAllTheBestModelsParametersPlease() 这样的功能。 - George Fisher
11
答案对我没有作用。 正确答案是:modelOnly.bestModel.stages[-1]._java_obj.parent().getRegParam()。如果您不使用管道,请删除 stages[-1] - Lynn Chen

11

这可能不如wernerchao的答案好(因为在变量中存储超参数不方便),但你可以通过这种方式快速查看交叉验证模型的最佳超参数:

cvModel.getEstimatorParamMaps()[ np.argmax(cvModel.avgMetrics) ]

4
假设cvModel3Day是您的模型名称,在Spark Scala中可以按如下方式提取参数:
val params = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].extractParamMap()

val depth = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxDepth

val iter = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxIter

val bins = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getMaxBins

val features  = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getFeaturesCol

val step = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getStepSize

val samplingRate  = cvModel3Day.bestModel.asInstanceOf[PipelineModel].stages(2).asInstanceOf[GBTClassificationModel].getSubsamplingRate

3
我也遇到了这个问题,不幸的是你只能获取特定模型的特定参数。对于逻辑回归,你可以访问截距和权重,但遗憾的是你无法检索regParam。以下是实现方法:
best_lr = cv.bestModel

#get weigths
best_lr.weights
>>>DenseVector([3.1573])

#or better
best_lr.coefficients
>>>DenseVector([3.1573])

#get intercept
best_lr.intercept
>>>-1.0829958115287153

如我之前所述,每个模型都有几个可以提取的参数。从管道中获取相关模型(例如,在交叉验证器运行管道时获取cv.bestModel)可以使用以下方法:

best_pipeline = cv.bestModel
best_pipeline.stages
>>>[Tokenizer_4bc8884ad68b4297fd3c,CountVectorizer_411fbdeb4100c2bfe8ef, PCA_4c538d67e7b8f29ff8d0,LogisticRegression_4db49954edc7033edc76]

每个模型都是通过简单的列表索引获取的

best_lr = best_pipeline.stages[3]

现在可以应用上述内容。

2
实际上有两个问题:
  • 拟合模型的方面(如系数和截距)是什么
  • 用于拟合bestModel的元参数是什么。
不幸的是,拟合的估算器(模型)的Python API不允许(容易地)直接访问估算器的参数,这使得回答后一个问题变得困难。
然而,使用Java API可以解决此问题。 为了完整起见,首先设置交叉验证模型。
%pyspark
from pyspark.ml import Pipeline
from pyspark.ml.classification import LogisticRegression
from pyspark.ml.evaluation import BinaryClassificationEvaluator
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder
logit = LogisticRegression(maxIter=10)
pipeline = Pipeline(stages=[logit])
paramGrid = ParamGridBuilder() \
    .addGrid(logit.regParam, [0, 0.01, 0.05, 0.1, 0.5, 1]) \
    .addGrid(logit.elasticNetParam, [0.0, 0.1, 0.5, 0.8, 1]) \
    .build()
evaluator = BinaryClassificationEvaluator(metricName = 'areaUnderPR')
crossval = CrossValidator(estimator=pipeline,
                          estimatorParamMaps=paramGrid,
                          evaluator=evaluator,
                          numFolds=5)
tuned_model = crossval.fit(train)
model = tuned_model.bestModel

可以使用Java对象上的通用方法来获取参数值,而不必显式地引用getRegParam()等方法:

java_model = model.stages[-1]._java_obj
{param.name: java_model.getOrDefault(java_model.getParam(param.name)) 
    for param in paramGrid[0]}

这将执行以下步骤:
  1. 获取由估算器从最佳模型的最后一个阶段创建的拟合逻辑回归模型crossval.fit(..).bestModel.stages[-1]
  2. _java_obj获取内部Java对象
  3. paramGrid(一个字典列表)中获取所有配置名称。仅使用第一行,假设它是一个实际的网格,即每行包含相同的键。否则,您需要收集任何行中曾经使用过的所有名称。
  4. 从Java对象中获取相应的Param<T>参数标识符。
  5. Param<T>实例传递给getOrDefault()函数以获取实际值。

2
这个需要花费几分钟来解密,但我弄清楚了。
from pyspark.ml.tuning import CrossValidator, ParamGridBuilder

    # prenotation: I've built out my model already and I am calling the validator ParamGridBuilder
paramGrid = ParamGridBuilder() \
                          .addGrid(hashingTF.numFeatures, [1000]) \
                          .addGrid(linearSVC.regParam, [0.1, 0.01]) \
                          .addGrid(linearSVC.maxIter, [10, 20, 30]) \
                          .build()
crossval = CrossValidator(estimator=pipeline,\
                          estimatorParamMaps=paramGrid,\
                          evaluator=MulticlassClassificationEvaluator(),\
                          numFolds=2)

cvModel = crossval.fit(train)

prediction = cvModel.transform(test)


bestModel = cvModel.bestModel

    #applicable to your model to pull list of all stages
for x in range(len(bestModel.stages)):
print bestModel.stages[x]


    #get stage feature by calling correct Transformer then .get<parameter>()
print bestModel.stages[3].getNumFeatures()

2

我知道这是一个老问题,但是我找到了一种方法来解决它。
@Pierre Gourseaud给我们提供了一个很好的方式来获取最佳模型的超参数。

hyperparams = model_cv.getEstimatorParamMaps()[np.argmax(model_cv.avgMetrics)]
print(hyperparams)
[(Param(parent='ALS_cd65d45ab31c', name='implicitPrefs', doc='whether to use implicit preference'),
  True),
 (Param(parent='ALS_cd65d45ab31c', name='nonnegative', doc='whether to use nonnegative constraint for least squares'),
  True),
 (Param(parent='ALS_cd65d45ab31c', name='coldStartStrategy', doc="strategy for dealing with unknown or new users/items at prediction time. This may be useful in cross-validation or production scenarios, for handling user/item ids the model has not seen in the training data. Supported values: 'nan', 'drop'."),
  'drop'),
 (Param(parent='ALS_cd65d45ab31c', name='rank', doc='rank of the factorization'),
  28),
 (Param(parent='ALS_cd65d45ab31c', name='maxIter', doc='max number of iterations (>= 0).'),
  20),
 (Param(parent='ALS_cd65d45ab31c', name='regParam', doc='regularization parameter (>= 0).'),
  0.01),
 (Param(parent='ALS_cd65d45ab31c', name='alpha', doc='alpha for implicit preference'),
  20.0)]


但这不是时尚的方式,所以你可以这样做:
import re

hyper_list = []

for i in range(len(hyperparams.items())):
    hyper_name = re.search("name='(.+?)'", str([x for x in hyperparams.items()][i])).group(1)
    hyper_value = [x for x in hyperparams.items()][i][1]

    hyper_list.append({hyper_name: hyper_value})

print(hyper_list)
[{'implicitPrefs': True}, {'nonnegative': True}, {'coldStartStrategy': 'drop'}, {'rank': 28}, {'maxIter': 20}, {'regParam': 0.01}, {'alpha': 20.0}]

在我的情况下,我已经训练了一个ALS模型,但在你的情况下同样适用,因为我也使用了交叉验证!

0
如果您只想获取参数名称及其值
 {param.name: value for param, value in zip(cvModel.bestModel.extractParamMap().keys(), cvModel.bestModel.extractParamMap().values())}

如果您不介意描述等内容,只需使用以下内容

cvModel.bestModel.extractParamMap()

输出将会是

    Out[58]: {'aggregationDepth': 2,
 'elasticNetParam': 0.0,
 'family': 'auto',
 'featuresCol': 'features',
 'fitIntercept': True,
 'labelCol': 'label',
 'maxBlockSizeInMB': 0.0,
 'maxIter': 10,
 'predictionCol': 'prediction',
 'probabilityCol': 'probability',
 'rawPredictionCol': 'rawPrediction',
 'regParam': 0.01,
 'standardization': True,
 'threshold': 0.5,
 'tol': 1e-06}

并且

    Out[54]: {Param(parent='LogisticRegression_a6db1af69019', name='aggregationDepth', doc='suggested depth for treeAggregate (>= 2).'): 2,
 Param(parent='LogisticRegression_a6db1af69019', name='elasticNetParam', doc='the ElasticNet mixing parameter, in range [0, 1]. For alpha = 0, the penalty is an L2 penalty. For alpha = 1, it is an L1 penalty.'): 0.0,
 Param(parent='LogisticRegression_a6db1af69019', name='family', doc='The name of family which is a description of the label distribution to be used in the model. Supported options: auto, binomial, multinomial'): 'auto',
 Param(parent='LogisticRegression_a6db1af69019', name='featuresCol', doc='features column name.'): 'features',
 Param(parent='LogisticRegression_a6db1af69019', name='fitIntercept', doc='whether to fit an intercept term.'): True,
 Param(parent='LogisticRegression_a6db1af69019', name='labelCol', doc='label column name.'): 'label',
 Param(parent='LogisticRegression_a6db1af69019', name='maxBlockSizeInMB', doc='maximum memory in MB for stacking input data into blocks. Data is stacked within partitions. If more than remaining data size in a partition then it is adjusted to the data size. Default 0.0 represents choosing optimal value, depends on specific algorithm. Must be >= 0.'): 0.0,
 Param(parent='LogisticRegression_a6db1af69019', name='maxIter', doc='max number of iterations (>= 0).'): 10,
 Param(parent='LogisticRegression_a6db1af69019', name='predictionCol', doc='prediction column name.'): 'prediction',
 Param(parent='LogisticRegression_a6db1af69019', name='probabilityCol', doc='Column name for predicted class conditional probabilities. Note: Not all models output well-calibrated probability estimates! These probabilities should be treated as confidences, not precise probabilities.'): 'probability',
 Param(parent='LogisticRegression_a6db1af69019', name='rawPredictionCol', doc='raw prediction (a.k.a. confidence) column name.'): 'rawPrediction',
 Param(parent='LogisticRegression_a6db1af69019', name='regParam', doc='regularization parameter (>= 0).'): 0.01,
 Param(parent='LogisticRegression_a6db1af69019', name='standardization', doc='whether to standardize the training features before fitting the model.'): True,
 Param(parent='LogisticRegression_a6db1af69019', name='threshold', doc='Threshold in binary classification prediction, in range [0, 1]. If threshold and thresholds are both set, they must match.e.g. if threshold is p, then thresholds must be equal to [1-p, p].'): 0.5,
 Param(parent='LogisticRegression_a6db1af69019', name='tol', doc='the convergence tolerance for iterative algorithms (>= 0).'): 1e-06}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接