Sklearn管道:在ColumnTransformer中使用OneHotEncode后获取特征名称

87

我希望在完成管道拟合后获取特征名称。

categorical_features = ['brand', 'category_name', 'sub_category']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])
    
numeric_features = ['num1', 'num2', 'num3', 'num4']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

那么

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor', GradientBoostingRegressor())])

在使用pandas数据帧进行拟合后,我可以从

clf.steps[1][1].feature_importances_

获取特征重要性,并尝试使用clf.steps[0][1].get_feature_names()但是出现了错误。

AttributeError: Transformer num (type Pipeline) does not provide get_feature_names.

我该如何从中获取特征名称?

5个回答

87
你可以使用以下代码来访问 feature_names:

```snippet```

clf.named_steps['preprocessor'].transformers_[1][1]\
   .named_steps['onehot'].get_feature_names(categorical_features)

使用sklearn版本>= 0.21,我们甚至可以使它更简单:

clf['preprocessor'].transformers_[1][1]\
    ['onehot'].get_feature_names(categorical_features)

可重复的示例:

import numpy as np
import pandas as pd
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LinearRegression

df = pd.DataFrame({'brand': ['aaaa', 'asdfasdf', 'sadfds', 'NaN'],
                   'category': ['asdf', 'asfa', 'asdfas', 'as'],
                   'num1': [1, 1, 0, 0],
                   'target': [0.2, 0.11, 1.34, 1.123]})

numeric_features = ['num1']
numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())])

categorical_features = ['brand', 'category']
categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore'))])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)])

clf = Pipeline(steps=[('preprocessor', preprocessor),
                      ('regressor',  LinearRegression())])
clf.fit(df.drop('target', 1), df['target'])

clf.named_steps['preprocessor'].transformers_[1][1]\
   .named_steps['onehot'].get_feature_names(categorical_features)

# ['brand_NaN' 'brand_aaaa' 'brand_asdfasdf' 'brand_sadfds' 'category_as'
#  'category_asdf' 'category_asdfas' 'category_asfa']

12
如何正确将特征重要性与所有特征名称(数值+分类)匹配?特别是使用OHE(handle_unknown='ignore')时。 - Paul
@Paul 在我的情况下,我将 df.columns 与 feature_names 结合在一起,然后从名称列表中删除了 categorical_features,然后将其与 feature_importances_ 结合在一起。 - ResidentSleeper
27
确切地说,您如何确保它们以正确的顺序组合,以便与特征重要性向量相匹配?这似乎并不简单,希望能提供优雅的代码片段。 - Paul
5
组合顺序将与管道步骤相同。因此,我们可以找到特征的确切顺序。https://dev59.com/H1MH5IYBdhLWcg3w7Vzx#57534118 的回答可能对您有用。 - Venkatachalam
2
那么 StandardScaler() 没有 get_feature_names()。我们需要稍后将数字字段和独热编码字段的字段名组合起来吗?是否有其他 API 可以提供完整的特征名称? - Ozkan Serttas
显示剩余2条评论

37

Scikit-Learn 1.0现在具有新功能,可以跟踪特征名称。

from sklearn.compose import make_column_transformer
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.preprocessing import StandardScaler

# SimpleImputer does not have get_feature_names_out, so we need to add it
# manually. This should be fixed in Scikit-Learn 1.0.1: all transformers will
# have this method.
# g
SimpleImputer.get_feature_names_out = (lambda self, names=None:
                                       self.feature_names_in_)

num_pipeline = make_pipeline(SimpleImputer(), StandardScaler())
transformer = make_column_transformer(
    (num_pipeline, ["age", "height"]),
    (OneHotEncoder(), ["city"]))
pipeline = make_pipeline(transformer, LinearRegression())



df = pd.DataFrame({"city": ["Rabat", "Tokyo", "Paris", "Auckland"],
                   "age": [32, 65, 18, 24],
                   "height": [172, 163, 169, 190],
                   "weight": [65, 62, 54, 95]},
                  index=["Alice", "Bunji", "Cécile", "Dave"])



pipeline.fit(df, df["weight"])


## get pipeline feature names
pipeline[:-1].get_feature_names_out()


## specify feature names as your columns
pd.DataFrame(pipeline[:-1].transform(df),
             columns=pipeline[:-1].get_feature_names_out(),
             index=df.index)

1
对我来说,这导致Estimator编码器未提供get_feature_names_out。你是要调用pipeline[:-1].get_feature_names_out()吗? - Andi Anderle
2
@AndiAnderle get_feature_names_out 在所有评估器上都没有实现,请参见 https://github.com/scikit-learn/scikit-learn/issues/21308 ,我正在使用 pipeline[:-1] 来仅选择列变换步骤。 - ZAKARYA ROUZKI
这正是我所做的(pipeline[0].get_feature_names_out())。pipeline[0] 是我的 ColumnTransformer,其中包含 OrdinalEncoder 和 SimpleImputer。仍然显示上述错误。 - Andi Anderle
你解决了这个问题吗?如果是,请分享,我很感兴趣:我也在尝试做同样的事情。我正在使用OrdinalEncoder,以及带有imputer和ordinal encoder的pipeline,并且我需要在拟合后跟踪特征名称。 - Just trying
SimpleImputer 这个类没有 get_feature_names_out 函数,除非你使用的是每夜构建版的 sklearn。 - chris
显示剩余2条评论

10

编辑:实际上,Peter 在 ColumnTransformer 文档 中回答了这个问题:

转换后的特征矩阵中的列顺序遵循转换器列表中指定的列顺序。未在结果转换的特征矩阵中指定的原始特征矩阵的列将被删除,除非在 passthrough 关键字中指定。使用 passthrough 指定的这些列将被添加到转换器的输出右侧。


为了完整 Venkatachalam 的回答,在 Paul 的评论中,ColumnTransformer.get_feature_names() 方法中特征名称的顺序取决于在 ColumnTransformer 实例化时声明的 steps 变量的顺序。

我找不到任何文档,所以我只是用下面的示例进行试验,这让我理解了逻辑。

from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.preprocessing import RobustScaler

class testEstimator(BaseEstimator,TransformerMixin):
    def __init__(self,string):
        self.string = string

    def fit(self,X):
        return self

    def transform(self,X):
        return np.full(X.shape, self.string).reshape(-1,1)

    def get_feature_names(self):
        return self.string

transformers = [('first_transformer',testEstimator('A'),1), ('second_transformer',testEstimator('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('scaler',RobustScaler()), ('transformer', column_transformer)]
pipeline = Pipeline(steps)

dt_test = np.zeros((1000,2))
pipeline.fit_transform(dt_test)

for name,step in pipeline.named_steps.items():
    if hasattr(step, 'get_feature_names'):
        print(step.get_feature_names())
为了举例得更具代表性,我添加了一个RobustScaler,并将ColumnTransformer嵌套在Pipeline中。顺便说一句,您会发现我的版本是Venkatachalam的特征名称循环步骤的方式。您可以通过列表解析将名称解包为稍微更可用的变量:
[i for i in v.get_feature_names() for k, v in pipeline.named_steps.items() if hasattr(v,'get_feature_names')]

尝试调整dt_test和estimators的参数,以了解特征名称是如何构建的,并在get_feature_names()函数中如何连接。以下是使用输入列输出2个列的转换器的另一个示例:

class testEstimator3(BaseEstimator,TransformerMixin):
    def __init__(self,string):
        self.string = string

    def fit(self,X):
        self.unique = np.unique(X)[0]
        return self

    def transform(self,X):
        return np.concatenate((X.reshape(-1,1), np.full(X.shape,self.string).reshape(-1,1)), axis = 1)

    def get_feature_names(self):
        return list((self.unique,self.string))

dt_test2 = np.concatenate((np.full((1000,1),'A'),np.full((1000,1),'B')), axis = 1)

transformers = [('first_transformer',testEstimator3('A'),1), ('second_transformer',testEstimator3('B'),0)]
column_transformer = ColumnTransformer(transformers)
steps = [('transformer', column_transformer)]
pipeline = Pipeline(steps)

pipeline.fit_transform(dt_test2)
for step in pipeline.steps:
    if hasattr(step[1], 'get_feature_names'):
        print(step[1].get_feature_names())

5
如果您正在寻找如何在最后一个管道使用ColumnTransformer后访问列名,您可以按照此处的示例进行访问:
full_pipeline中有两个管道genderrelevent_experience
full_pipeline = ColumnTransformer([
    ("gender", gender_encoder, ["gender"]),
    ("relevent_experience", relevent_experience_encoder, ["relevent_experience"]),
])

gender 管道如下所示:

gender_encoder = Pipeline([
    ('imputer', SimpleImputer(strategy='most_frequent')),
    ("cat", OneHotEncoder())
])

在拟合完full_pipeline之后,您可以使用以下代码片段访问列名

full_pipeline.transformers_[0][1][1].get_feature_names_out() 

在我的情况下,输出结果为: array(['x0_Female', 'x0_Male', 'x0_Other'], dtype=object)


2
这对我不起作用,因为我得到了AttributeError: 'ColumnTransformer'对象没有'transformers_'属性。 - Maths12

1

你已经非常接近正确的答案了。在构建完你的管道之后:

clf = Pipeline(steps=[('preprocessor', preprocessor),
                  ('regressor', DecisionTreeRegressor())])

clffeaturestarget变量配合使用,如下所示:

clf.fit(features, target)

然后,您应该能够访问 OneHotEncoder 的特征名称:

clf.named_steps['preprocessor'].transformers_[1][1].named_steps['onehot'].get_feature_names_out()

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接