使用sklearn的GridSearchCV与管道一起，只需预处理一次。

Question

使用sklearn的GridSearchCV与管道一起，只需预处理一次。

pythonnumpymachine-learningscikit-learngrid-search

40

我正在使用scikit-learn来调整模型的超参数。我使用管道将预处理和估计器连接在一起。我的问题的简单版本如下：

import numpy as np
from sklearn.model_selection import GridSearchCV
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression


grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

_ = grid.fit(X=np.random.rand(10, 3),
             y=np.random.randint(2, size=(10,)))

在我的情况下，预处理（在玩具示例中的StandardScale()）非常耗时，并且我没有调整它的任何参数。

因此，当我执行示例时，StandardScaler被执行了12次。2fit/predict * 2 cv * 3 parameters.但是，每次StandardScaler针对不同的参数C的值执行时，返回的是相同的输出，所以计算一次，然后只运行管道的估计器部分会更有效率。

我可以手动将管道分为预处理（未调整超参数）和估计器。但是，要将预处理应用于数据，我应该仅提供训练集。因此，我必须手动实现拆分，而不能使用GridSearchCV。

有没有一种简单/标准的方法来避免在使用GridSearchCV时重复进行预处理？

- Marc Garcia

https://scikit-learn.org/stable/modules/compose.html - AnandJ

4个回答

14

对于那些遇到了和我一样略微不同的问题的人。

假设您有这个流程：

classifier = Pipeline([
    ('vectorizer', CountVectorizer(max_features=100000, ngram_range=(1, 3))),
    ('clf', RandomForestClassifier(n_estimators=10, random_state=SEED, n_jobs=-1))])

然后，在指定参数时，您需要包含用于估计器的 'clf_' 名称。因此，参数网格将如下所示：

params={'clf__max_features':[0.3, 0.5, 0.7],
        'clf__min_samples_leaf':[1, 2, 3],
        'clf__max_depth':[None]
        }

- Ayan Omarov

4

目前的scikit-learn版本（0.18.1）不支持此功能。但是，在github项目中提出了一个修复方案：https://github.com/scikit-learn/scikit-learn/issues/8830， https://github.com/scikit-learn/scikit-learn/pull/8322。

- Victor Deplasse

0

我晚加入了这个团队，但我带来了一个新的解决方案/见解，使用Pipeline()：

子管道包含您的模型（回归/分类器）作为单个组件
主管道由常规组件组成：
- 预处理组件，例如缩放器、降维等
- 您的重新拟合GridSearchCV(regressor, param)，使用所需/最佳参数为您的模型进行调整（注意：不要忘记refit=True），基于@Vivek Kumar的备注ref

#build an end-to-end pipeline, and supply the data into a regression model and train and fit within the main pipeline.
#It avoids leaking the test\val-set into the train-set
# Create the sub-pipeline

#create and train the sub-pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.compose import TransformedTargetRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

sgd_subpipeline = Pipeline(steps=[#('scaler', MinMaxScaler()), # better to not rescale internally
                                  ('SGD',    SGDRegressor(random_state=0)),
])

# Define the hyperparameter grid
param_grid = {
    'SGD__loss':     ['squared_error', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'huber'],
    'SGD__penalty':  ['l2', 'l1', 'elasticnet'],
    'SGD__alpha':    [0.0001, 0.001, 0.01],
    'SGD__l1_ratio': [0.15, 0.25, 0.5]
}

# Perform grid search
grid_search = GridSearchCV(sgd_subpipeline, param_grid, cv=5, n_jobs=-1, verbose=True, refit=True)
grid_search.fit(X_train, y_train)

# Get the best model
best_sgd_reg = grid_search.best_estimator_

# Print the best hyperparameters
print('=========================================[Best Hyperparameters info]=====================================')
print(grid_search.best_params_)

# summarize best
print('Best MAE: %.3f'  % grid_search.best_score_)
print('Best Config: %s' % grid_search.best_params_)
print('==========================================================================================================')

# Create the main pipeline by chaining refitted GridSerachCV sub-pipeline

sgd_pipeline = Pipeline(steps=[('scaler', MinMaxScaler()), # better to rescale externally
                               ('SGD',    grid_search),
])

# Fit the best model on the training data within pipeline (like fit any model/transformer): pipe.fit(traindf[features], traindf[labels]) #X, y

sgd_pipeline.fit(X_train, y_train)

#--------------------------------------------------------------
# Displaying a Pipeline with a Preprocessing Step and Regression
from sklearn import set_config
set_config(display="text")

或者，您可以使用TransformedTargetRegressor（特别是如果您需要对y进行反缩放，如@mloning在这里评论的那样），并将此组件链接起来，包括您的回归模型ref。注意：

除非需要缩放，否则不需要设置transform参数；请查看相关帖子1，2，3，4，以及它的score
注意这个关于不进行缩放here的备注，因为：

...使用缩放y实际上会丢失单位....

Here，建议：

...在管道之外进行转换...

#build an end-to-end pipeline, and supply the data into a regression model and train and fit within main pipeline.
#It avoids leaking the test\val-set into the train-set
# Create the sub-pipeline
from sklearn.linear_model import SGDRegressor
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV

sgd_subpipeline = Pipeline(steps=[#('scaler', MinMaxScaler()), # better to not rescale internally
                                  ('SGD',    SGDRegressor(random_state=0)),
])

# Define the hyperparameter grid
param_grid = {
    'SGD__loss':     ['squared_error', 'epsilon_insensitive', 'squared_epsilon_insensitive', 'huber'],
    'SGD__penalty':  ['l2', 'l1', 'elasticnet'],
    'SGD__alpha':    [0.0001, 0.001, 0.01],
    'SGD__l1_ratio': [0.15, 0.25, 0.5]
}

# Perform grid search
grid_search = GridSearchCV(sgd_subpipeline, param_grid, cv=5, n_jobs=-1, verbose=True, refit=True)
grid_search.fit(X_train, y_train)

# Get the best model
best_sgd_reg = grid_search.best_estimator_

# Print the best hyperparameters
print('=========================================[Best Hyperparameters info]=====================================')
print(grid_search.best_params_)

# summarize best
print('Best MAE: %.3f'  % grid_search.best_score_)
print('Best Config: %s' % grid_search.best_params_)
print('==========================================================================================================')



# Create the main pipeline using sub-pipeline made of TransformedTargetRegressor component
from sklearn.compose import TransformedTargetRegressor

TTR_sgd_pipeline = Pipeline(steps=[('scaler', MinMaxScaler()), # better to rescale externally
                                   #('SGD', SGDRegressor()),
                                    ('TTR', TransformedTargetRegressor(regressor= grid_search, #SGDRegressor(),
                                                                       #transformer=MinMaxScaler(),
                                                                       #func=np.log,
                                                                       #inverse_func=np.exp,
                                                                       check_inverse=False))
])



# Fit the best model on the training data within pipeline (like fit any model/transformer): pipe.fit(traindf[features], traindf[labels]) #X, y
#best_sgd_pipeline.fit(X_train, y_train)
TTR_sgd_pipeline.fit(X_train, y_train)

#--------------------------------------------------------------
# Displaying a Pipeline with a Preprocessing Step and Regression
from sklearn import set_config
set_config(display="diagram")

- Mario

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vivek Kumar · Accepted Answer

更新：理想情况下，下面的答案不应该被使用，因为它会导致数据泄露，正如评论中所讨论的那样。在这个答案中，GridSearchCV将调整已经被StandardScaler预处理过的数据的超参数，这是不正确的。在大多数情况下，这可能并不重要，但是对于太过敏感于缩放的算法将会得到错误的结果。

实际上，GridSearchCV也是一个估计器，实现了fit()和predict()方法，并被流水线使用。

因此，不需要这样写：

grid = GridSearchCV(make_pipeline(StandardScaler(), LogisticRegression()),
                    param_grid={'logisticregression__C': [0.1, 10.]},
                    cv=2,
                    refit=False)

做这个：

clf = make_pipeline(StandardScaler(), 
                    GridSearchCV(LogisticRegression(),
                                 param_grid={'logisticregression__C': [0.1, 10.]},
                                 cv=2,
                                 refit=True))

clf.fit()
clf.predict()

它将只调用一次StandardScalar()，而不是像您所描述的多次调用，用于clf.fit()的一次调用。

编辑：

当在管道中使用GridSearchCV时，将refit更改为True。如文档中所述:

refit：布尔值，默认值为True 使用整个数据集重新拟合最佳估计器。如果“False”，则无法使用此GridSearchCV实例进行预测在拟合之后。

如果refit=False，则clf.fit()将没有任何效果，因为管道内的GridSearchCV对象将在fit()之后重新初始化。当refit=True时，GridSearchCV将使用传递给fit()的整个数据集上的最佳得分参数组合进行重新拟合。

如果您想创建一个Pipeline，只需查看网格搜索的分数，则适用refit=False。如果您想调用clf.predict()方法，则必须使用refit=True，否则将抛出Not Fitted错误。