使用Scikit-Learn GridSearchCV和PredefinedSplit进行交叉验证 - 可疑的好交叉验证结果

Question

使用Scikit-Learn GridSearchCV和PredefinedSplit进行交叉验证 - 可疑的好交叉验证结果

pythonscikit-learncross-validationgrid-search

3

我想使用scikit-learn的GridSearchCV进行网格搜索，并使用预定义的开发和验证分割（1倍交叉验证）来计算交叉验证误差。

我担心自己做错了什么，因为我的验证准确度异常高。我认为我出错的地方在于：将训练数据分成开发集和验证集，在开发集上进行训练，并记录验证集上的交叉验证得分。我的准确性可能会被夸大，因为我实际上是在混合开发和验证集上进行训练，然后在验证集上进行测试。我不确定是否正确地使用了scikit-learn的PredefinedSplit模块。详情如下：根据this answer，我执行了以下操作：

    import numpy as np
    from sklearn.model_selection import train_test_split, PredefinedSplit
    from sklearn.grid_search import GridSearchCV

    # I split up my data into training and test sets. 
    X_train, X_test, y_train, y_test = train_test_split(
        data[training_features], data[training_response], test_size=0.2, random_state=550)

    # sanity check - dimensions of training and test splits
    print(X_train.shape)
    print(X_test.shape)
    print(y_train.shape)
    print(y_test.shape)

    # dimensions of X_train and x_test are (323430, 26) and (323430,1) respectively
    # dimensions of X_test and y_test are (80858, 26) and (80858, 1)

    ''' Now, I define indices for a pre-defined split. 
    this is a 323430 dimensional array, where the indices for the development
    set are set to -1, and the indices for the validation set are set to 0.'''

    validation_idx = np.repeat(-1, y_train.shape)
    np.random.seed(550)
    validation_idx[np.random.choice(validation_idx.shape[0], 
           int(round(.2*validation_idx.shape[0])), replace = False)] = 0

    # Now, create a list which contains a single tuple of two elements, 
    # which are arrays containing the indices for the development and
    # validation sets, respectively.
    validation_split = list(PredefinedSplit(validation_idx).split())

    # sanity check
    print(len(validation_split[0][0])) # outputs 258744 
    print(len(validation_split[0][0]))/float(validation_idx.shape[0])) # outputs .8
    print(validation_idx.shape[0] == y_train.shape[0]) # True
    print(set(validation_split[0][0]).intersection(set(validation_split[0][1]))) # set([])

现在，我使用 GridSearchCV 进行网格搜索。我的意图是，在网格上为每个参数组合拟合一个模型，并在将生成的估计器应用于验证集时记录交叉验证分数，以便在开发集上适合。

    # a vanilla XGboost model
    model1 = XGBClassifier()

    # create a parameter grid for the number of trees and depth of trees
    n_estimators = range(300, 1100, 100)
    max_depth = [8, 10]
    param_grid = dict(max_depth=max_depth, n_estimators=n_estimators)

    # A grid search. 
    # NOTE: I'm passing a PredefinedSplit object as an argument to the `cv` parameter.
    grid_search = GridSearchCV(model1, param_grid,
           scoring='neg_log_loss',
           n_jobs=-1, 
           cv=validation_split,
           verbose=1)

现在，这里引起了我警觉。我使用网格搜索找到的最佳估计器来查找验证集上的准确性。它非常高-0.89207865689639176。更糟糕的是，如果我在数据开发集（刚刚训练的）上使用分类器，我得到的准确性几乎与此相同-0.89295597192591902。但是，当我在真正的测试集上使用分类器时，我得到的准确性要低得多，大约为0.78：

    # accurracy score on the validation set. This yields .89207865
    accuracy_score(y_pred = 
           grid_result2.predict(X_train.iloc[validation_split[0][1]]),
           y_true=y_train[validation_split[0][1]])

    # accuracy score when applied to the development set. This yields .8929559
    accuracy_score(y_pred = 
           grid_result2.predict(X_train.iloc[validation_split[0][0]]),
           y_true=y_train[validation_split[0][0]])

    # finally, the score when applied to the test set. This yields .783 
    accuracy_score(y_pred = grid_result2.predict(X_test), y_true = y_test)

对我来说，当模型应用于开发和验证数据集时的准确度几乎完全相同，而应用于测试集时准确度显著下降，这清楚地表明我无意中在验证数据上进行了训练，因此我的交叉验证分数并不能代表模型的真实准确度。

我似乎找不到错误所在 - 主要是因为我不知道GridSearchCV在接收一个PredefinedSplit对象作为cv参数的情况下内部执行了什么操作。

你有什么想法吗？如果需要更多细节/阐述，请告诉我。代码也在这个github笔记本中。

谢谢！

- Timi Bennatan

请在拟合后查看GridSearchCV的cv_results_属性。您可以获取每个折叠（在您的情况下为1）的训练和测试得分信息。 - Vivek Kumar

2个回答

0

是的，验证数据存在数据泄漏问题。您需要为GridSearchCV设置refit = False，它将不会重新拟合整个数据，包括训练和验证数据。

- Lodging

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alvin Thai · Accepted Answer

4

您需要设置refit=False（不是默认选项），否则网格搜索将在网格搜索完成后在整个数据集上重新拟合估计器（忽略cv）。

- Alvin Thai

同意你的答案，我总是使用Timi在这个问题中提到的方法：将数据分成训练集、验证集和测试集 -> 在验证集上使用1折交叉验证进行网格搜索，就像问题中所示，但要使用refit = True来获得在训练集+验证集上训练的最佳模型。最终当我展示我的模型性能时，我使用保留的测试数据集来衡量。请问我的方法是否有问题？ - SKSKSKSK