使用scikit-learn进行随机森林上的递归特征消除

17

我正在尝试使用 scikit-learn 和随机森林分类器执行递归特征消除,使用OOB ROC作为递归过程中每个子集的得分方法。

然而,当我尝试使用 RFECV 方法时,出现错误,显示 AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'

随机森林本质上没有系数,但是它们有基于Gini分数的排序。所以,我想知道如何解决这个问题。

请注意,我希望使用一种方法来明确告诉我从我的 pandas DataFrame 中选择了哪些特征以便于我使用递归特征选择尝试将最小量的数据输入到最终分类器中。

以下是示例代码:

from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pd.Series(iris.target, name='target')
rf = RandomForestClassifier(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=10, scoring='ROC', verbose=2)
selector=rfecv.fit(x, y)

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 336, in fit
    ranking_ = rfe.fit(X_train, y_train).ranking_
  File "/Users/bbalin/anaconda/lib/python2.7/site-packages/sklearn/feature_selection/rfe.py", line 148, in fit
    if estimator.coef_.ndim > 1:
AttributeError: 'RandomForestClassifier' object has no attribute 'coef_'

1
另一种方法是在调用predictpredict_proba后使用feature_importances_属性,这将返回一个按顺序排列的百分比数组。请参见在线示例 - EdChum
看到了这个,我想知道是否有什么东西可以让我进行10倍交叉验证并识别最佳特征子集。 - Bryan
我曾经做过类似的事情,但是我通过排序特征重要性并逐步削减1、3或5个特征来手动完成。我没有使用你的方法,所以我不知道它是否可行。 - EdChum
我明天早上会发布我的代码,因为我的代码在我的工作电脑上,所以大约在英国夏令时上午8点左右。 - EdChum
没错。从计算上讲,对我拥有的所有特征(30,000+)运行分类器非常缓慢,所以我必须进行减少。 - Bryan
显示剩余2条评论
4个回答

21

以下是我为了让RandomForestClassifier适配RFECV所做的工作:

class RandomForestClassifierWithCoef(RandomForestClassifier):
    def fit(self, *args, **kwargs):
        super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
        self.coef_ = self.feature_importances_

如果您使用“accuracy”或“f1”得分,只需使用此类即可。对于“roc_auc”,RFECV会抱怨不支持多类格式。使用以下代码将其更改为两类分类,即可使“roc_auc”评分起作用。(使用Python 3.4.1和scikit-learn 0.15.1)

y=(pd.Series(iris.target, name='target')==2).astype(int)

将其插入到您的代码中:

from sklearn import datasets
import pandas as pd
from pandas import Series
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_selection import RFECV

class RandomForestClassifierWithCoef(RandomForestClassifier):
    def fit(self, *args, **kwargs):
        super(RandomForestClassifierWithCoef, self).fit(*args, **kwargs)
        self.coef_ = self.feature_importances_

iris = datasets.load_iris()
x=pd.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=(pd.Series(iris.target, name='target')==2).astype(int)
rf = RandomForestClassifierWithCoef(n_estimators=500, min_samples_leaf=5, n_jobs=-1)
rfecv = RFECV(estimator=rf, step=1, cv=2, scoring='roc_auc', verbose=2)
selector=rfecv.fit(x, y)

为什么这里没有进行训练/测试分离?即X_train,X_test,y_train,y_test = train_test_split(X,y,test_size = 0.30,stratify = y,random_state = 42) - mikelowry
通过 RFECV 进行训练/测试拆分。 - A.P.

6
我提交了一个请求,要求添加 "coef_" ,以便可以使用 "RFECV" 和 "RandomForestClassifier"。不过,这个更改已经被实现了,并且会在0.17版本中推出。如果您想立即使用它,可以拉取最新的开发版本。详情请访问:https://github.com/scikit-learn/scikit-learn/issues/4945

6
这是我的代码,我已经稍微整理了一下,使其与您的任务相关:

features_to_use = fea_cols #  this is a list of features
# empty dataframe
trim_5_df = DataFrame(columns=features_to_use)
run=1
# this will remove the 5 worst features determined by their feature importance computed by the RF classifier
while len(features_to_use)>6:
    print('number of features:%d' % (len(features_to_use)))
    # build the classifier
    clf = RandomForestClassifier(n_estimators=1000, random_state=0, n_jobs=-1)
    # train the classifier
    clf.fit(train[features_to_use], train['OpenStatusMod'].values)
    print('classifier score: %f\n' % clf.score(train[features_to_use], df['OpenStatusMod'].values))
    # predict the class and print the classification report, f1 micro, f1 macro score
    pred = clf.predict(test[features_to_use])
    print(classification_report(test['OpenStatusMod'].values, pred, target_names=status_labels))
    print('micro score: ')
    print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='micro'))
    print('macro score:\n')
    print(metrics.precision_recall_fscore_support(test['OpenStatusMod'].values, pred, average='macro'))
    # predict the class probabilities
    probs = clf.predict_proba(test[features_to_use])
    # rescale the priors
    new_probs = kf.cap_and_update_priors(priors, probs, private_priors, 0.001)
    # calculate logloss with the rescaled probabilities
    print('log loss: %f\n' % log_loss(test['OpenStatusMod'].values, new_probs))
    row={}
    if hasattr(clf, "feature_importances_"):
        # sort the features by importance
        sorted_idx = np.argsort(clf.feature_importances_)
        # reverse the order so it is descending
        sorted_idx = sorted_idx[::-1]
        # add to dataframe
        row['num_features'] = len(features_to_use)
        row['features_used'] = ','.join(features_to_use)
        # trim the worst 5
        sorted_idx = sorted_idx[: -5]
        # swap the features list with the trimmed features
        temp = features_to_use
        features_to_use=[]
        for feat in sorted_idx:
            features_to_use.append(temp[feat])
        # add the logloss performance
        row['logloss']=[log_loss(test['OpenStatusMod'].values, new_probs)]
    print('')
    # add the row to the dataframe
    trim_5_df = trim_5_df.append(DataFrame(row))
run +=1

这里我首先列出了一个要训练并进行预测的功能列表,然后使用功能重要性来削减最差的5个功能,并重复此过程。在每次运行期间,我会添加一行以记录预测性能,以便稍后进行分析。
原始代码更大,我有不同的分类器和数据集要分析,但我希望您可以从上面的内容中得到图片。我注意到对于随机森林,每次运行中我删除的功能数量都会影响性能,因此每次削减1、3和5个功能会导致不同的最佳功能集。
我发现使用GradientBoostingClassifer更可预测和可重复,因为无论我每次削减1个、3个还是5个功能,最终的最佳功能集是一致的。
我希望我没有向你灌输常识,您可能比我更懂,但是我的削减分析方法是使用快速分类器获得最佳功能集的粗略想法,然后使用表现更好的分类器,然后开始超参数调整,再次进行粗粒度比较,然后在了解最佳参数后进行细粒度比较。

3

这是我想出的解决方案。它非常简单,依赖于一个自定义的准确度指标(称为加权准确度),因为我正在对高度不平衡的数据集进行分类。但如果需要,它应该很容易地变得更具扩展性。

from sklearn import datasets
import pandas
from sklearn.ensemble import RandomForestClassifier
from sklearn import cross_validation
from sklearn.metrics import confusion_matrix


def get_enhanced_confusion_matrix(actuals, predictions, labels):
    """"enhances confusion_matrix by adding sensivity and specificity metrics"""
    cm = confusion_matrix(actuals, predictions, labels = labels)
    sensitivity = float(cm[1][1]) / float(cm[1][0]+cm[1][1])
    specificity = float(cm[0][0]) / float(cm[0][0]+cm[0][1])
    weightedAccuracy = (sensitivity * 0.9) + (specificity * 0.1)
    return cm, sensitivity, specificity, weightedAccuracy

iris = datasets.load_iris()
x=pandas.DataFrame(iris.data, columns=['var1','var2','var3', 'var4'])
y=pandas.Series(iris.target, name='target')

response, _  = pandas.factorize(y)

xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x, response, test_size = .25, random_state = 36583)
print "building the first forest"
rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2, n_jobs = -1, verbose = 1)
rf.fit(xTrain, yTrain)
importances = pandas.DataFrame({'name':x.columns,'imp':rf.feature_importances_
                                }).sort(['imp'], ascending = False).reset_index(drop = True)

cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
numFeatures = len(x.columns)

rfeMatrix = pandas.DataFrame({'numFeatures':[numFeatures], 
                              'weightedAccuracy':[weightedAccuracy], 
                              'sensitivity':[sensitivity], 
                              'specificity':[specificity]})

print "running RFE on  %d features"%numFeatures

for i in range(1,numFeatures,1):
    varsUsed = importances['name'][0:i]
    print "now using %d of %s features"%(len(varsUsed), numFeatures)
    xTrain, xTest, yTrain, yTest = cross_validation.train_test_split(x[varsUsed], response, test_size = .25)
    rf = RandomForestClassifier(n_estimators = 500, min_samples_split = 2,
                                n_jobs = -1, verbose = 1)
    rf.fit(xTrain, yTrain)
    cm, sensitivity, specificity, weightedAccuracy = get_enhanced_confusion_matrix(yTest, rf.predict(xTest), [0,1])
    print("\n"+str(cm))
    print('the sensitivity is %d percent'%(sensitivity * 100))
    print('the specificity is %d percent'%(specificity * 100))
    print('the weighted accuracy is %d percent'%(weightedAccuracy * 100))
    rfeMatrix = rfeMatrix.append(
                                pandas.DataFrame({'numFeatures':[len(varsUsed)], 
                                'weightedAccuracy':[weightedAccuracy], 
                                'sensitivity':[sensitivity], 
                                'specificity':[specificity]}), ignore_index = True)    
print("\n"+str(rfeMatrix))    
maxAccuracy = rfeMatrix.weightedAccuracy.max()
maxAccuracyFeatures = min(rfeMatrix.numFeatures[rfeMatrix.weightedAccuracy == maxAccuracy])
featuresUsed = importances['name'][0:maxAccuracyFeatures].tolist()

print "the final features used are %s"%featuresUsed

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接