获取 sklearn pipeline 中的特征名称

7
我希望将输出的np数组与特征匹配以创建一个新的Pandas数据帧。
这是我的流程管道:
from sklearn.pipeline import Pipeline
# Categorical pipeline
categorical_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(missing_values=np.nan, strategy='most_frequent')),
    ('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
     ('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
     ('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
     (continuous_preprocessing, continuous_cols),
     (categorical_preprocessing, categorical_cols),
)
# Final pipeline
pipeline = Pipeline(
[('Preprocessing', preprocessing)]
)

这是我的命名方式:

X_train = pipeline.fit_transform(X_train)
X_val = pipeline.transform(X_val)
X_test = pipeline.transform(X_test)

我获取特征名称时得到的结果如下:

pipeline['Preprocessing'].transformers_[1][1]['Ordinal encoding'].get_feature_names()

输出:

AttributeError: 'OrdinalEncoder' object has no attribute 'get_feature_names'

这里有一个类似的stackoverflow问题:Sklearn Pipeline: Get feature names after OneHotEncode In ColumnTransformer


你的 sklearn 版本是多少? - user17242583
https://dev59.com/ZrPma4cB1Zd3GeqPjwjj#55524004 - Scott Boston
@ScottBoston 我没有使用计数向量化器,我甚至没有文本数据。我知道一些sklearn方法有一个get feature names,但是我怎么能确定结果数据集的列和列名会匹配呢? - Kevin
@richardec 最新版。 - Kevin
1个回答

6
目前,一些transformer模块可以使用.get_feature_names_out()方法,而另一些则不行。这会产生一些问题,例如当您想从由PipelineColumnTransformer实例输出的np.array创建格式良好的DataFrame时。最新版本中,.get_feature_names()已被弃用,取而代之的是.get_feature_names_out()
就您正在使用的transformer而言,StandardScaler属于第一类暴露该方法的transformer,而SimpleImputerOrdinalEncoder则属于第二类。文档在“Methods”段落中显示了可用的方法。正如上文所述,这会导致在pipelinecategorical_preprocessingcontinuous_preprocessing pipelines,以及preprocessing ColumnTransformer实例中进行pd.DataFrame(pipeline.fit_transform(X_train), columns=pipeline.get_feature_names_out())这样的操作时出现问题。
目前在sklearn中正在进行一项工作,即为所有评估器添加.get_feature_names_out()方法。这在github问题#21308中得到了跟踪,其中分支很多(每个PR处理一个特定模块)。例如,preprocessing模块的issue #21079将使OrdinalEncoder等transformer模块更丰富,impute模块的issue #21078将使SimpleImputer更丰富。我想一旦所有相关PR被合并,新版本就会发布。
同时,在我看来,您应该使用自定义解决方案来满足您的需求。以下是一个简单的示例,虽然不一定与您的需求相符,但旨在提供一种(可能的)操作方式:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import OrdinalEncoder, StandardScaler
from sklearn.compose import make_column_transformer, make_column_selector

X = pd.DataFrame({'city': ['London', 'London', 'Paris', 'Sallisaw', ''],
                  'title': ['His Last Bow', 'How Watson Learned the Trick', 'A Moveable Feast', 'The Grapes of Wrath', 'The Jungle'],
                  'expert_rating': [5, 3, 4, 5, np.NaN],
                  'user_rating': [4, 5, 4, np.NaN, 3]})
X

enter image description here

num_cols = X.select_dtypes(include=np.number).columns.tolist()
cat_cols = X.select_dtypes(exclude=np.number).columns.tolist()

# Categorical pipeline
categorical_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(missing_values='', strategy='most_frequent')),
    ('Ordinal encoding', OrdinalEncoder(handle_unknown='use_encoded_value', unknown_value=-1)),
]
)
# Continuous pipeline
continuous_preprocessing = Pipeline(
[
    ('Imputation', SimpleImputer(missing_values=np.nan, strategy='mean')),
    ('Scaling', StandardScaler())
]
)
# Creating preprocessing pipeline
preprocessing = make_column_transformer(
    (continuous_preprocessing, num_cols),
    (categorical_preprocessing, cat_cols),
)

# Final pipeline
pipeline = Pipeline(
    [('Preprocessing', preprocessing)]
)

X_trans = pipeline.fit_transform(X)

pd.DataFrame(X_trans, columns= num_cols + cat_cols)

enter image description here


1
谢谢,我做了类似的事情,只是将列的名称分配给新创建的df。 - Kevin

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接