随机森林中的超参数调整

Question

随机森林中的超参数调整

pythonmachine-learningscikit-learnrandom-forestgrid-search

4

我正在尝试使用波士顿数据集上的随机森林算法来预测房价medv，并借助sklearn的RandomForestRegressor。总共进行了3次迭代，如下所示：

迭代1：使用默认超参数的模型

#1. import the class/model
from sklearn.ensemble import RandomForestRegressor
#2. Instantiate the estimator
RFReg = RandomForestRegressor(random_state = 1, n_jobs = -1) 
#3. Fit the model with data aka model training
RFReg.fit(X_train, y_train)

#4. Predict the response for a new observation
y_pred = RFReg.predict(X_test)


y_pred_train = RFReg.predict(X_train)

迭代1的结果

{'RMSE Test': 2.9850839211419435, 'RMSE Train': 1.2291604936401441}

迭代2：我使用RandomizedSearchCV来获取超参数的最佳值

from sklearn.ensemble import RandomForestRegressor
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1) 

param_grid = { 
    'max_features' : ["auto", "sqrt", "log2"],
    'min_samples_split' : np.linspace(0.1, 1.0, 10),
     'max_depth' : [x for x in range(1,20)]


from sklearn.model_selection import RandomizedSearchCV
CV_rfc = RandomizedSearchCV(estimator=RFReg, param_distributions =param_grid, n_jobs = -1, cv= 10, n_iter = 50)
CV_rfc.fit(X_train, y_train)

所以我得到的最佳超参数如下

CV_rfc.best_params_
#{'min_samples_split': 0.1, 'max_features': 'auto', 'max_depth': 18}
CV_rfc.best_score_
#0.8021713812777814

所以我使用以下最佳超参数训练了一个新模型。

#1. import the class/model
from sklearn.ensemble import RandomForestRegressor
#2. Instantiate the estimator
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1, min_samples_split = 0.1, max_features = 'auto', max_depth = 18) 
#3. Fit the model with data aka model training
RFReg.fit(X_train, y_train)

#4. Predict the response for a new observation
y_pred = RFReg.predict(X_test)


y_pred_train = RFReg.predict(X_train)

迭代2的结果

{'RMSE Test': 3.2836794902147926, 'RMSE Train': 2.71230367772569}

迭代三: 我使用GridSearchCV获取超参数的最优值

from sklearn.ensemble import RandomForestRegressor
RFReg = RandomForestRegressor(n_estimators = 500, random_state = 1, n_jobs = -1) 

param_grid = { 
    'max_features' : ["auto", "sqrt", "log2"],
    'min_samples_split' : np.linspace(0.1, 1.0, 10),
     'max_depth' : [x for x in range(1,20)]

}

from sklearn.model_selection import GridSearchCV
CV_rfc = GridSearchCV(estimator=RFReg, param_grid=param_grid, cv= 10, n_jobs = -1)
CV_rfc.fit(X_train, y_train)

所以我得到的最佳超参数如下

CV_rfc.best_params_
#{'max_depth': 12, 'max_features': 'auto', 'min_samples_split': 0.1}
CV_rfc.best_score_
#0.8021820114800677

迭代 3 的结果

{'RMSE Test': 3.283690568225705, 'RMSE Train': 2.712331014201783}

我的函数用于评估RMSE。

def model_evaluate(y_train, y_test, y_pred, y_pred_train):
    metrics = {}
    #RMSE Test
    rmse_test = np.sqrt(mean_squared_error(y_test, y_pred))
    #RMSE Train
    rmse_train = np.sqrt(mean_squared_error(y_train, y_pred_train))

    metrics = {
              'RMSE Test': rmse_test,
              'RMSE Train': rmse_train}

    return metrics

在进行了三次迭代后，我有以下问题：

为什么经过调参的模型的结果比使用默认参数的模型还要差，即使我使用了 RandomSearchCV 和 GridSearchCV。理论上，在交叉验证的情况下，经过调参的模型应该会给出更好的结果。
我知道交叉验证只会针对 param_grid 中存在的值的组合进行。可能存在一些不在我的 param_grid 中但是很好的值。那么我该如何处理这种情况？
我该如何决定尝试哪些取值范围来增加机器学习模型的准确性，例如 max_features、min_samples_split、max_depth 或者其他任何超参数。（这样我就至少可以得到比使用默认超参数的模型更好的调参模型）

- Rookie_123

1

你的问题#2和#3并没有“硬性”的科学答案（甚至没有广泛的指导方针），这些属于经验的一部分（你自己也可能已经获得了一些，这也可以在实践中转化为“总是尝试默认参数”）; 可能你的问题#1的答案可能就在于你的问题#2... - desertnaut

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hellpanderr · Accepted Answer

为什么即使我使用RandomSearchCV和GridSearchCV，调整后的模型的结果比默认参数的模型更差。理想情况下，经过交叉验证调整的模型应该能够得到良好的结果。

你的第二个问题会在一定程度上回答你的第一个问题，但是我试图在波士顿数据集上重新生成你的结果，使用默认参数得到了 {'test_rmse': 3.987, 'train_rmse': 1.442}，使用随机搜索调整得到了{'test_rmse': 3.98, 'train_rmse': 3.426}的“调整”参数，使用网格搜索调整得到了 {'test_rmse': 3.993, 'train_rmse': 3.481}。然后我使用了hyperopt，并提供了以下参数空间。

 {'max_depth': hp.choice('max_depth', range(1, 100)),
    'max_features': hp.choice('max_features', range(1, x_train.shape[1])),
    'min_samples_split': hp.uniform('min_samples_split', 0.1, 1)}

大约进行了200次尝试后，结果如下所示：

，因此我将间隔扩大到'min_samples_split'，0.01，1，这使我得到了最佳结果{'test_rmse':3.278，'train_rmse':1.716}，其中min_samples_split等于0.01。根据文档，min_samples_split的公式为ceil(min_samples_split * n_samples)，在我们的情况下，np.ceil(0.1 * len(x_train))=34，这对于像这样的小数据集来说可能有点大。

我知道交叉验证只适用于param_grid中存在的值的组合。可能有一些好但未包含在我的param_grid中的值。那么我如何处理这种情况？

我应该尝试哪些max_features，min_samples_split，max_depth或者任何超参数的值范围以提高机器学习模型的准确性呢？（这样我至少可以获得优化过的模型，而不是使用默认超参数的模型）

你无法提前知道这个问题的答案，因此你必须针对每个算法进行研究，以了解通常会搜索哪种参数空间（一个好的来源是kaggle，例如google kaggle kernel random forest），然后将它们合并，考虑数据集特征并使用某种贝叶斯优化算法（有多个现有库可供选择）来优化它们，该算法尝试选择最佳的新参数值。