网格搜索交叉验证随机森林回归器调参最佳参数。

12

我想要改进这个 GridSearchCV 的参数,用于一个 随机森林回归器

def Grid_Search_CV_RFR(X_train, y_train):
    from sklearn.model_selection import GridSearchCV
    from sklearn.model_selection import ShuffleSplit
    from sklearn.ensemble import RandomForestRegressor

    estimator = RandomForestRegressor()
    param_grid = { 
            "n_estimators"      : [10,20,30],
            "max_features"      : ["auto", "sqrt", "log2"],
            "min_samples_split" : [2,4,8],
            "bootstrap": [True, False],
            }

    grid = GridSearchCV(estimator, param_grid, n_jobs=-1, cv=5)

    grid.fit(X_train, y_train)

    return grid.best_score_ , grid.best_params_

def RFR(X_train, X_test, y_train, y_test, best_params):
    from sklearn.ensemble import RandomForestRegressor
    estimator = RandomForestRegressor(n_jobs=-1).set_params(**best_params)
    estimator.fit(X_train,y_train)
    y_predict = estimator.predict(X_test)
    print "R2 score:",r2(y_test,y_predict)
    return y_test,y_predict

def splitter_v2(tab,y_indicator):
    from sklearn.model_selection import train_test_split
    # Asignamos X e y, eliminando la columna y en X
    X = correlacion(tab,y_indicator)
    y = tab[:,y_indicator]
    # Separamos Train y Test respectivamente para X e y
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2)
    return X_train, X_test, y_train, y_test

我使用了这个函数5次,代码如下:

function


for i in range(5):
    print "Loop: " , i
    print "--------------"
    X_train, X_test, y_train, y_test = splitter_v2(tabla,1)
    best_score, best_params = Grid_Search_CV_RFR(X_train, y_train)
    y_test , y_predict = RFR(X_train, X_test, y_train, y_test, best_params)
    print "Best Score:" ,best_score
    print "Best params:",best_params

这是结果

Loop:  0
--------------
R2 score: 0.900071279487
Best Score: 0.61802821072
Best params: {'max_features': 'log2', 'min_samples_split': 2, 'bootstrap': False, 'n_estimators': 10}
Loop:  1
--------------
R2 score: 0.993462885564
Best Score: 0.671309726329
Best params: {'max_features': 'log2', 'min_samples_split': 4, 'bootstrap': False, 'n_estimators': 10}
Loop:  2
--------------
R2 score: -0.181378339338
Best Score: -30.9012120698
Best params: {'max_features': 'log2', 'min_samples_split': 4, 'bootstrap': True, 'n_estimators': 20}
Loop:  3
--------------
R2 score: 0.750116663033
Best Score: 0.71472985391
Best params: {'max_features': 'log2', 'min_samples_split': 4, 'bootstrap': False, 'n_estimators': 30}
Loop:  4
--------------
R2 score: 0.692075744759
Best Score: 0.715012972471
Best params: {'max_features': 'sqrt', 'min_samples_split': 2, 'bootstrap': True, 'n_estimators': 30}

为什么在 R2 得分上我会得到不同的结果?是因为我选择了 CV=5 吗?还是因为我没有在随机森林回归器中明确指定 random_state=0 ?


随机森林有点“随意”,这就是为什么结果可能会有所不同。为什么差异如此之大?也许数据是垃圾。或者对于手头的数据来说,树木太少了。 - MB-F
我应该增加n_estimators的值吗?也许是[10,20,30,40,50]?谢谢你的帮助! - ambigus9
而不是尝试[100,1000,10000] - MB-F
2
添加 random_state 变量,然后尝试。 - Vivek Kumar
2个回答

0
for model in models:
    m = str(model)
    print(m)
    # Наш Pipeline
    text_clf = Pipeline([('vect', CountVectorizer()),
                      ('tfidf', TfidfTransformer()),
                      ('clf', model),
    ])
    # Обучение    
    text_clf = text_clf.fit(X_train.to_numpy(), y_train)
    # Предсказание
    pred = text_clf.predict(X_test)
    # Метрики
    print('accuracy_score', accuracy_score(pred, y_test))
    print('recall_score', recall_score(pred, y_test, average="macro"))
    print('f1_score', f1_score(pred, y_test, average="macro"))

#lr
C = [1,10,25,50,100,150]
solver = ['newton-cg', 'sag', 'saga', 'lbfgs']

# rfc 
n_estimators = [50,100,200,300,500]
max_features = ["auto", "sqrt", "log2"]
max_depth = [3,6]

# Knc 
n_neighbors=[5,10,15,20]
p=[1,2]

请在您的答案中添加一些描述。 - William Baker Morrison
仅有代码的问题并不高效。请像@WilliamBakerMorrison建议的那样描述您所做的事情。这样,原帖作者将更容易理解您的问题 :) - Diego Ramirez

0
def adj_r2(r2, n, p): 
     return 1-((1-r2)*(n-1)/(n-p-1))



for i in range(45,120,1):
    for j in range(2,16,1):
        for k in range(10,30,1):
            rf = RandomForestRegressor(n_estimators = k, random_state=i,max_depth=j)

            rf.fit(Xtrain, ytrain)

            trainadrj32 = adj_r2(rf.score(Xtrain, ytrain), len(Xtrain), len(Xtrain.columns))
            testadrj32 = adj_r2(rf.score(Xtest, ytest), len(Xtest), len(Xtest.columns))
            if (abs(trainadrj32 - testadrj32) < .01) and (trainadrj32 > .80):
                print(k, i, j)
                print('************** adj R2 Train: {} **************'.format(adj_r2(rf.score(Xtrain, ytrain), len(Xtrain), len(Xtrain.columns))))
                print('************** adj R2 Test: {} **************'.format(adj_r2(rf.score(Xtest, ytest), len(Xtest), len(Xtest.columns))))
                print('**************')

1
你的回答可以通过提供更多支持性信息来改进。请编辑以添加进一步的细节,例如引用或文献,以便他人可以确认你的回答是正确的。您可以在帮助中心找到有关如何撰写良好答案的更多信息。 - Community

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接