k折交叉验证本身并不会使模型更加准确。以xgb为例,有许多超参数需要指定,例如(subsample, eta)。为了了解所选择的参数在未见过的数据上的表现,我们使用k折交叉验证将数据分成多个训练和测试样本,并测量样本外的准确性。
通常情况下,我们会尝试几种可能的参数值,并选择平均误差最小的值。然后您将使用这些参数重新拟合模型。这帖子及其回答对此进行了讨论。
例如,以下是类似于您所做的内容,我们仅获得1组值的训练/测试误差:
import pandas as pd
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=500,class_sep=0.7)
X_train, X_test, y_train, y_test = train_test_split(X, y,
test_size=0.33, random_state=42)
data_dmatrix = xgb.DMatrix(data=X_train,label=y_train)
params = {'objective':'binary:logistic','eval_metric':'logloss',
'eta':0.01,
'subsample':0.1}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params, nfold=5, metrics = 'logloss',seed=42)
train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.689600 0.000517 0.689820 0.001009
1 0.686462 0.001612 0.687151 0.002089
2 0.683626 0.001438 0.684667 0.003009
3 0.680450 0.001100 0.681929 0.003604
4 0.678269 0.001399 0.680310 0.002781
5 0.675170 0.001867 0.677254 0.003086
6 0.672349 0.002483 0.674432 0.004349
7 0.668964 0.002484 0.671493 0.004579
8 0.666361 0.002831 0.668978 0.004200
9 0.663682 0.003881 0.666744 0.003598
最后一行是上一轮的结果,我们用它来进行评估。
如果我们测试多个 eta
值(例如,和 subsample
一起:
grid = pd.DataFrame({'eta':[0.01,0.05,0.1]*2,
'subsample':np.repeat([0.1,0.3],3)})
eta subsample
0 0.01 0.1
1 0.05 0.1
2 0.10 0.1
3 0.01 0.3
4 0.05 0.3
5 0.10 0.3
通常我们可以使用GridSearchCV,但下面是一些使用xgb.cv的东西:
def fit(x):
params = {'objective':'binary:logistic',
'eval_metric':'logloss',
'eta':x[0],
'subsample':x[1]}
xgb_cv = xgb.cv(dtrain=data_dmatrix, params=params,
nfold=5, metrics = 'logloss',seed=42)
return xgb_cv[-1:].values[0]
grid[['train-logloss-mean','train-logloss-std',
'test-logloss-mean','test-logloss-std']] = grid.apply(fit,axis=1,result_type='expand')
eta subsample train-logloss-mean train-logloss-std test-logloss-mean test-logloss-std
0 0.01 0.1 0.663682 0.003881 0.666744 0.003598
1 0.05 0.1 0.570629 0.012555 0.580309 0.023561
2 0.10 0.1 0.503440 0.017761 0.526891 0.031659
3 0.01 0.3 0.646587 0.002063 0.653741 0.004201
4 0.05 0.3 0.512229 0.008013 0.545113 0.018700
5 0.10 0.3 0.414103 0.012427 0.472379 0.032606
我们可以看到当 eta = 0.10
和 subsample = 0.3
时,结果最佳,所以接下来只需要使用这些参数重新拟合模型:
xgb_reg = xgb.XGBRegressor(objective='binary:logistic',
eval_metric = 'logloss',
eta = 0.1,
subsample = 0.3)
xgb_reg.fit(X_train, y_train)