Optuna分数与交叉验证分数的区别？

Question

Optuna分数与交叉验证分数的区别？

3

从optuna得到的精度分数和cross_val_score中的得分不同。为什么会出现这种情况？我应该选择哪个分数？我在cross_val_score中使用了从optuna获得的超参数。

def objective_lgb(trial):
    num_leaves = trial.suggest_int("num_leaves", 2, 1000)
    max_depth = trial.suggest_int("max_depth", 2, 100)
    learning_rate = trial.suggest_float('learning_rate', 0.001, 1)
    n_estimators = trial.suggest_int('n_estimators', 100, 2000)
    min_child_samples = trial.suggest_int('min_child_samples', 3, 1000)
    subsample = trial.suggest_float('subsample', 0.000001, 1)
    colsample_bytree = trial.suggest_float('colsample_bytree', 0.00000001, 1)
    reg_alpha = trial.suggest_float('reg_alpha', 0, 400)
    reg_lambda = trial.suggest_float("reg_lambda", 0, 400)
    importance_type = trial.suggest_categorical('importance_type', ["split", "gain"])

    lgb_clf = lgb.LGBMClassifier(random_state=1,
                         objective="multiclass",
                         num_class = 3, 
                         importance_type=importance_type,
                         num_leaves=num_leaves,
                         max_depth=max_depth,
                         learning_rate=learning_rate,
                         n_estimators=n_estimators,
                         min_child_samples=min_child_samples,
                         subsample=subsample,
                         colsample_bytree=colsample_bytree,
                         reg_alpha=reg_alpha,
                         reg_lambda=reg_lambda
                         )
    score = cross_val_score(lgb_clf, train_x, train_y, n_jobs=-1, cv=KFold(n_splits=10,  shuffle=True, random_state=1), scoring='accuracy')
    mean_score = score.mean()
    return mean_score
lgb_study = optuna.create_study(direction="maximize")
lgb_study.optimize(objective_lgb, n_trials=1500)

lgb_trial = lgb_study.best_trial
print("accuracy:", lgb_trial.value)
print()
print("Best params:", lgb_trial.params)
=========================================================
def light_check(x,params):
     model = lgb.LGBMClassifier()
     scores = cross_val_score(model,x,y,cv=KFold(n_splits=10,  shuffle=True, random_state=1),n_jobs=-1)
     mean = scores.mean()
     return scores, mean
light_check(x,{'num_leaves': 230, 'max_depth': 53, 'learning_rate': 0.04037430031226232, 'n_estimators': 1143, 'min_child_samples': 381, 'subsample': 0.12985990464862135, 'colsample_bytree': 0.8914118949904919, 'reg_alpha': 31.869348047391053, 'reg_lambda': 17.45653692887209, 'importance_type': 'split'})

- Ten

2个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kilian · Answer 1

从我所看到的，您在optuna调用中使用了X_train、y_train，而在light_check中传递了x和y。假设您在某些未知的代码中进行了拆分，那么optuna的数据集会更小，因此您会得到不同的数字。

- AmiR · Answer 2

Optuna会将您在目标函数中返回的值作为准确度得分输出，该得分对应于问题中的mean_score。此外，在交叉验证过程中，您必须向模型提供训练数据，您已经正确地完成了这一步骤。然而，在light_check函数中，您错误地将所有数据都提供给了模型。

对于模型的最终评估，正确的方法是使用最初分离出来的一部分数据作为测试集进行评估。验证数据专门用于模型验证目的，而测试数据则用于模型评估。

为了更好地理解，请访问以下地址以查看我演示如何通过Optuna设置超参数的代码。这将使您更全面地了解模型分析和评估。

https://www.kaggle.com/code/amir9473/tuning-hyperparameters-ml-classification-acu-94