XGBoost和Sklearn中的对数损失是否相同?

3

我正在使用XGBoost处理一个新的数据集。以下是我的代码:

import xgboost as xgb
import pandas as pd
import numpy as np

train = pd.read_csv("train_users_processed_onehot.csv")
labels = train["Buy"].map({"Y":1, "N":0})

features = train.drop("Buy", axis=1)
data_dmat = xgb.DMatrix(data=features, label=labels)

params={"max_depth":5, "min_child_weight":2, "eta": 0.1, "subsamples":0.9, "colsample_bytree":0.8, "objective" : "binary:logistic", "eval_metric": "logloss", "seed": 2333}
rounds = 6000

result = xgb.cv(params=params, dtrain=data_dmat, num_boost_round=rounds, early_stopping_rounds=50, as_pandas=True, seed=2333)
print result

结果为(省略中间结果):
         test-logloss-mean  test-logloss-std  train-logloss-mean  
0             0.683354          0.000058            0.683206  
165           0.622318          0.000661            0.607680   

但是当我使用GridSearchCV进行参数调整时,发现结果与预期有很大不同。更具体地说,这是我的代码:

import xgboost as xgb
from sklearn.model_selection import GridSearchCV
from xgboost.sklearn import XGBClassifier
import numpy as np
import pandas as pd

train_dataframe = pd.read_csv("train_users_processed_onehot.csv")
train_labels = train_dataframe["Buy"].map({"Y":1, "N":0})
train_features = train_dataframe.drop("Buy", axis=1)

params = {"max_depth": [5], "min_child_weight": [2]}

estimator = XGBClassifier(learning_rate=0.1, n_estimators=170, max_depth=2, min_child_weight=4, objective="binary:logistic", subsample=0.9, colsample_bytree=0.8, seed=2333)

gsearch1 = GridSearchCV(estimator, param_grid=params, n_jobs=4, iid=False, verbose=1, scoring="neg_log_loss")
gsearch1.fit(train_features.values, train_labels.values)

print pd.DataFrame(gsearch1.cv_results_)
print gsearch1.best_params_
print -gsearch1.best_score_

我得到了:

   mean_fit_time  mean_score_time  mean_test_score  mean_train_score  
0       87.71497         0.209772        -3.134132         -0.567306 

很明显,3.134132和0.622318非常不同。这是为什么呢?

谢谢!


我对这个问题进行了更多的研究,请移步至https://dev59.com/n53ha4cB1Zd3GeqPS1Yw。 - DarkZero
1个回答

0

你需要向两者传递不同的参数:

  • max_depth: 5 vs 2
  • eta: 0.1 vs 0.3 (默认值)
  • min_child_weight: 2 vs 4

你向sklearn传递的参数更为保守(这样模型过拟合的可能性就较小),因此算法不会试图过度拟合数据。反过来,你得到了一个较低的分数 - 这正是预期的结果。


据我所知,param 中指定的参数将覆盖 XGBClassifier 中指定的参数,而 XGBClassifier 中的 learning_rate 正好是 XGBoost 中的 eta,因此我猜想我正在向它们传递相同的参数。 - DarkZero

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接