从随机森林分类器中解读特征重要性值

Question

从随机森林分类器中解读特征重要性值

pythonnumpymachine-learningstatisticsscikit-learn

6

我是机器学习的初学者，我的第一个程序的结果有些难以理解。这是设置:

我有一组书籍评论数据。这些书籍可以用大约1600个限定词中的任意数量标记。评论这些书籍的人也可以用这些限定词来标记自己，以表示他们喜欢读具有该标记的内容。

该数据集有每个限定词的一列。对于每篇评论，如果给定的限定词用于标记书籍和评论者，则记录值1。如果在给定评论中不存在针对特定限定词的“匹配”，则记录值为0。

还有一个“分数”列，每个评论都有一个1-5的整数（该评论的“星级评分”）。我的目标是确定哪些特征最重要，以获得高得分。

这是我现在的代码 (https://gist.github.com/souldeux/99f71087c712c48e50b7):

def determine_feature_importance(df):
    #Determines the importance of individual features within a dataframe
    #Grab header for all feature values excluding score & ids
    features_list = df.columns.values[4::]
    print "Features List: \n", features_list

    #set X equal to all feature values, excluding Score & ID fields
    X = df.values[:,4::]

    #set y equal to all Score values
    y = df.values[:,0]

    #fit a random forest with near-default paramaters to determine feature importance
    print '\nCreating Random Forest Classifier...\n'
    forest = RandomForestClassifier(oob_score=True, n_estimators=10000)
    print '\nFitting Random Forest Classifier...\n'
    forest.fit(X,y)
    feature_importance = forest.feature_importances_
    print feature_importance

    #Make importances relative to maximum importance
    print "\nMaximum feature importance is currently: ", feature_importance.max()
    feature_importance = 100.0 * (feature_importance / feature_importance.max())
    print "\nNormalized feature importance: \n", feature_importance
    print "\nNormalized maximum feature importance: \n", feature_importance.max()
    print "\nTo do: set fi_threshold == max?"
    print "\nTesting: setting fi_threshhold == 1"
    fi_threshold=1

    #get indicies of all features over fi_threshold
    important_idx = np.where(feature_importance > fi_threshold)[0]
    print "\nRetrieved important_idx: ", important_idx

    #create a list of all feature names above fi_threshold
    important_features = features_list[important_idx]
    print "\n", important_features.shape[0], "Important features(>", fi_threshold, "% of max importance:\n", important_features

    #get sorted indices of important features
    sorted_idx = np.argsort(feature_importance[important_idx])[::-1]
    print "\nFeatures sorted by importance (DESC):\n", important_features[sorted_idx]

    #generate plot
    pos = np.arange(sorted_idx.shape[0]) + .5
    plt.subplot(1,2,2)
    plt.barh(pos,feature_importance[important_idx][sorted_idx[::-1]],align='center')
    plt.yticks(pos, important_features[sorted_idx[::-1]])
    plt.xlabel('Relative importance')
    plt.ylabel('Variable importance')
    plt.draw()
    plt.show()

    X = X[:, important_idx][:, sorted_idx]


    return "Feature importance determined"

我已经成功生成了一张图，但我不太确定这张图的含义。据我所知，这张图展示了每个特征对分数变量产生的影响力大小。但是，我很困惑如何判断这种影响是正面的还是负面的。

- souldeux

2个回答

0

随机森林可以衡量分类任务中任何特征的相对重要性。

通常，我们测量如果失去该特征的真实值会造成的损失。每次一个特征的值被打乱并且测量预测准确度的损失。

因为这是在构建新决策树时每次都会执行的操作，而随机森林由多个树组成，所以这些值是可靠的。

请查看此页面。

从forest.feature_importances_返回的数字越高，意味着它们在此分类任务中更为重要。

然而，在您的情况下，这并不适用。我建议尝试多项式朴素贝叶斯分类器并在训练后检查feature_log_prob_。这样，您就可以看到给定类别的特征概率P(x_i|y)。

- attollos

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- lejlot · Accepted Answer

简而言之，你不需要这样做。决策树（随机森林的基本构建块）并不是这样工作的。如果你使用线性模型，那么"正面"或"负面"特征会有相对简单的区别，因为它对最终结果的唯一影响就是被添加（带权重）。然而，决策树的集合可以为每个特征制定任意复杂的规则，例如："如果书有红色封面并且超过100页，则如果它包含龙，它得到高分"，但"如果书有蓝色封面并且超过100页，则如果它包含龙，它得到低分"等。

特征重要性只告诉你哪些特征"对决策有贡献"，而不是"如何贡献"，因为有时它会按这种方式工作，有时则按另一种方式工作。

你可以做什么？你可以进行极端简化——假设你只关心特征在所有其他特征缺失的情况下，现在——一旦知道哪些特征很重要，你可以计算这些特征在每个类别中出现的次数（在你的例子中即得分）。这样，你就可以得到分布。

P(gets score X|has feature Y)

这将向您展示，在边际化后，它是否具有积极或消极的影响。