使用Imblearn Pipeline和GridSearchCV进行交叉验证

9

我正在尝试使用imblearn中的Pipeline类和GridSearchCV来获取分类不平衡数据集的最佳参数。根据这里提到的答案,我想在验证集中留出重新采样,并仅对训练集进行重新采样,而imblearnPipeline似乎正在执行此操作。然而,实施接受的解决方案时出现错误。请告诉我我做错了什么。以下是我的实现:

def imb_pipeline(clf, X, y, params):

    model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', clf)
    ])

    score={'AUC':'roc_auc', 
           'RECALL':'recall',
           'PRECISION':'precision',
           'F1':'f1'}

    gcv = GridSearchCV(estimator=model, param_grid=params, cv=5, scoring=score, n_jobs=12, refit='F1',
                       return_train_score=True)
    gcv.fit(X, y)

    return gcv

for param, classifier in zip(params, classifiers):
    print("Working on {}...".format(classifier[0]))
    clf = imb_pipeline(classifier[1], X_scaled, y, param) 
    print("Best parameter for {} is {}".format(classifier[0], clf.best_params_))
    print("Best `F1` for {} is {}".format(classifier[0], clf.best_score_))
    print('-'*50)
    print('\n')

参数:

[{'penalty': ('l1', 'l2'), 'C': (0.01, 0.1, 1.0, 10)},
 {'n_neighbors': (10, 15, 25)},
 {'n_estimators': (80, 100, 150, 200), 'min_samples_split': (5, 7, 10, 20)}]

分类器:
[('Logistic Regression',
  LogisticRegression(C=1.0, class_weight=None, dual=False, fit_intercept=True,
                     intercept_scaling=1, l1_ratio=None, max_iter=100,
                     multi_class='warn', n_jobs=None, penalty='l2',
                     random_state=None, solver='warn', tol=0.0001, verbose=0,
                     warm_start=False)),
 ('KNearestNeighbors',
  KNeighborsClassifier(algorithm='auto', leaf_size=30, metric='minkowski',
                       metric_params=None, n_jobs=None, n_neighbors=5, p=2,
                       weights='uniform')),
 ('Gradient Boosting Classifier',
  GradientBoostingClassifier(criterion='friedman_mse', init=None,
                             learning_rate=0.1, loss='deviance', max_depth=3,
                             max_features=None, max_leaf_nodes=None,
                             min_impurity_decrease=0.0, min_impurity_split=None,
                             min_samples_leaf=1, min_samples_split=2,
                             min_weight_fraction_leaf=0.0, n_estimators=100,
                             n_iter_no_change=None, presort='auto',
                             random_state=None, subsample=1.0, tol=0.0001,
                             validation_fraction=0.1, verbose=0,
                             warm_start=False))]

错误:
ValueError: Invalid parameter C for estimator Pipeline(memory=None,
         steps=[('sampling',
                 SMOTE(k_neighbors=5, kind='deprecated',
                       m_neighbors='deprecated', n_jobs=1,
                       out_step='deprecated', random_state=None, ratio=None,
                       sampling_strategy='auto', svm_estimator='deprecated')),
                ('classification',
                 LogisticRegression(C=1.0, class_weight=None, dual=False,
                                    fit_intercept=True, intercept_scaling=1,
                                    l1_ratio=None, max_iter=100,
                                    multi_class='warn', n_jobs=None,
                                    penalty='l2', random_state=None,
                                    solver='warn', tol=0.0001, verbose=0,
                                    warm_start=False))],
         verbose=False). Check the list of available parameters with `estimator.get_params().keys()`. """

你执行了 estimator.get_params().keys() 吗? - Kalpit
抱歉,我不明白。你是什么意思?怎么做? - Krishnang K Dalal
1个回答

12
请查看此示例以了解如何在管道中使用参数: - https://scikit-learn.org/stable/auto_examples/compose/plot_compare_reduction.html#sphx-glr-auto-examples-compose-plot-compare-reduction-py 每当使用管道时,您需要以一种方式发送参数,以便管道能够理解哪个步骤的参数属于哪个步骤。为此,它使用了您在 Pipeline 初始化期间提供的名称。
例如,在您的代码中:
model = Pipeline([
        ('sampling', SMOTE()),
        ('classification', clf)
    ])

要将参数p1传递给SMOTE,您应该使用sampling__p1作为参数,而不是p1

您将"classification"用作clf的名称,因此将其附加到应该进入clf的参数中。

尝试:

[{'classification__penalty': ('l1', 'l2'), 'classification__C': (0.01, 0.1, 1.0, 10)},
 {'classification__n_neighbors': (10, 15, 25)},
 {'classification__n_estimators': (80, 100, 150, 200), 'min_samples_split': (5, 7, 10, 20)}]

确保名称和参数之间有两个下划线。


顺便问一下,您知道Pipeline如何在将SMOTEGridSeachCV配对时排除对KFold cv的重采样吗? - Krishnang K Dalal
2
@KrishnangKDalal 这就是 imblearn 管道的设计方式。只有在调用 fit 时才会重新样本采样,而不是在调用 predict 时重新采样。 - Vivek Kumar
明白了!非常感谢。 - Krishnang K Dalal
我有一个问题:如果我在刚才描述的过程后使用 gcv.best_estimator_,那么得到的模型是否是用过采样后的数据进行训练的? - francesco

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接