从Sklearn管道中提取带有特征名称的特征重要性

Question

从Sklearn管道中提取带有特征名称的特征重要性

pythonpython-3.xscikit-learnpipelinerandom-forest

5

我想知道如何在使用预处理的管道分类器中从scikit-learn的Random Forest中提取特征重要性，并带有特征名称。这里的问题仅涉及提取特征重要性: 如何从Sklearn管道中提取特征重要性。根据我所做的简要研究，似乎在scikit-learn中不可能实现这一点，但我希望我是错误的。我还发现了一个名为ELI5（https://eli5.readthedocs.io/en/latest/overview.html）的软件包，它应该解决scikit-learn的这个问题，但它没有解决我的问题，因为输出的特征名称是x1、x2等而不是实际的特征名称。作为一种解决方法，我在管道之外进行了所有预处理，但我很想知道如何在管道中完成它。如果我可以提供任何有用的代码，请在评论中让我知道。

- Python Developer

我猜这取决于你所说的预处理是什么... 你能具体说明一下吗？ - MaximeKan

从文档中可以看到，feature_names选项适用于某些函数。希望这能有所帮助。https://eli5.readthedocs.io/en/latest/_modules/eli5/explain.html?highlight=feature%20names - TavoGLC

展示你正在使用的代码，并将其转换为管道。 - Vivek Kumar

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Diego Fernández · Accepted Answer

下面提供一个使用Xgboost获取特征重要性的示例：

num_transformer = Pipeline(steps=[
                  ('imputer', SimpleImputer(strategy='median')),
                  ('scaler', preprocessing.RobustScaler())])

cat_transformer = Pipeline(steps=[
                  ('imputer', SimpleImputer(strategy='most_frequent')),
                  ('onehot', preprocessing.OneHotEncoder(categories='auto', 
                                     sparse=False, 
                                     handle_unknown='ignore'))])

from sklearn.compose import ColumnTransformer

numerical_columns = X.columns[X.dtypes != 'category'].tolist()
categorical_columns = X.columns[X.dtypes == 'category'].tolist()

pipeline_procesado = ColumnTransformer(transformers=[
            ('numerical_preprocessing', num_transformer, numerical_columns),
       ('categorical_preprocessing', cat_transformer, categorical_columns)],
        remainder='passthrough',
        verbose=True)

# Create the classifier
classifier = XGBClassifier()

# Create the overall model as a single pipeline
pipeline = Pipeline([("transform_inputs", pipeline_procesado), ("classifier", 
classifier)])

pipeline.fit(X_train, y_train)

onehot_columns = pipeline.named_steps['transform_inputs'].named_transformers_['categorical_preprocessing'].named_steps['onehot'].get_feature_names(input_features=categorical_columns)


#you can get the values transformed with your pipeline
X_values = pipeline_procesado.fit_transform(X_train)

df_from_array_pipeline = pd.DataFrame(X_values, columns = numerical_columns + list(onehot_columns) )

feature_importance = pd.Series(data= pipeline.named_steps['classifier'].feature_importances_, index = np.array(numerical_columns + list(onehot_columns)))