我正在使用StratifiedKFold,所以我的代码看起来像这样:
def train_model(X,y,X_test,folds,model):
scores=[]
for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
X_train,X_valid = X[train_index],X[valid_index]
y_train,y_valid = y[train_index],y[valid_index]
model.fit(X_train,y_train)
y_pred_valid = model.predict(X_valid).reshape(-1,)
scores.append(roc_auc_score(y_valid, y_pred_valid))
print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
folds = StratifiedKFold(10,shuffle=True,random_state=0)
lr = LogisticRegression(class_weight='balanced',penalty='l1',C=0.1,solver='liblinear')
train_model(X_train,y_train,X_test,repeted_folds,lr)
现在,在训练模型之前,我想要对数据进行标准化。那么,哪种方法是正确的?
1)
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_test = scaler.transform(X_test)
在调用train_model函数之前,需要先执行以下操作:
2)
在函数内部进行标准化处理,代码如下:
def train_model(X,y,X_test,folds,model):
scores=[]
for fold_n, (train_index, valid_index) in enumerate(folds.split(X, y)):
X_train,X_valid = X[train_index],X[valid_index]
y_train,y_valid = y[train_index],y[valid_index]
scaler = StandardScaler()
X_train = scaler.fit_transform(X_train)
X_vaid = scaler.transform(X_valid)
X_test = scaler.transform(X_test)
model.fit(X_train,y_train)
y_pred_valid = model.predict(X_valid).reshape(-1,)
scores.append(roc_auc_score(y_valid, y_pred_valid))
print('CV mean score: {0:.4f}, std: {1:.4f}.'.format(np.mean(scores), np.std(scores)))
根据我的了解,在第二个选项中,我不会泄露数据。如果我不使用pipeline,哪种方法是正确的?如果我想使用交叉验证,该如何使用pipeline?
cv
参数的情况,例如使用交叉验证分割器。 - Horace