我正在处理一个数据集,使用线性回归来拟合模型。在最终确定之前,我想尝试使用超参数调优来获得最佳的可用模型。
我一直在通过管道运行数据,首先将字符串转换为数字,然后对其进行编码,再将所有列向量化,最后在应用线性回归之前进行缩放。我希望知道如何设置网格来开始超参数调优。
import pyspark.ml.feature as ft
WD_indexer = ft.StringIndexer(inputCol="Wind_Direction", outputCol="WD-num")
WD_encoder = ft.OneHotEncoder(inputCol="WD-num", outputCol='WD-vec')
featuresCreator = ft.VectorAssembler(inputCols=["Dew_Point", "Temperature",
"Pressure", "WD-vec", "Wind_Speed","Hours_Snow","Hours_Rain"], outputCol='features')
from pyspark.ml.feature import StandardScaler
feature_scaler = StandardScaler(inputCol="features",outputCol="sfeatures")
from pyspark.ml.regression import LinearRegression
lr = LinearRegression(featuresCol="sfeatures",labelCol="PM_Reading")
因此,流程管道看起来像这样:
from pyspark.ml import Pipeline
pipeline = Pipeline( stages = [WD_indexer, WD_encoder, featuresCreator, feature_scaler, lr] )
我该如何为这个管道设置网格?
谢谢