Scikit-learn管道 - 如何对不同列应用不同的转换

Question

Scikit-learn管道 - 如何对不同列应用不同的转换

24

我对sklearn中的pipeline还比较陌生，现在遇到了这个问题：我的数据集中有混合文本和数字的列，即某些列仅包含文本，其余列包含整数（或浮点数）。

我想知道是否有可能构建一个pipeline，在其中可以对文本特征调用LabelEncoder()，对数字列调用MinMaxScaler()，例如。我在网上看到的示例大多指向将LabelEncoder()应用于整个数据集而不是选择列。这可行吗？如果可以，任何提示将不胜感激。

- Javiar Sandra

3个回答

20

自v0.20版本以来，您可以使用ColumnTransformer来完成此操作。

- zachguo

3

请给一个例子。 - lightbox142

10

以下是一个可以帮助你了解 ColumnTransformer 的示例：

# FOREGOING TRANSFORMATIONS ON 'data' ...
# filter data
data = data[data['county'].isin(COUNTIES_OF_INTEREST)]

# define the feature encoding of the data
impute_and_one_hot_encode = Pipeline([
        ('impute', SimpleImputer(strategy='most_frequent')),
        ('encode', OneHotEncoder(sparse=False, handle_unknown='ignore'))
    ])

featurisation = ColumnTransformer(transformers=[
    ("impute_and_one_hot_encode", impute_and_one_hot_encode, ['smoker', 'county', 'race']),
    ('word2vec', MyW2VTransformer(min_count=2), ['last_name']),
    ('numeric', StandardScaler(), ['num_children', 'income'])
])

# define the training pipeline for the model
neural_net = KerasClassifier(build_fn=create_model, epochs=10, batch_size=1, verbose=0, input_dim=109)
pipeline = Pipeline([
    ('features', featurisation),
    ('learner', neural_net)])

# train-test split
train_data, test_data = train_test_split(data, random_state=0)
# model training
model = pipeline.fit(train_data, train_data['label'])

你可以在以下链接找到完整的代码：https://github.com/stefan-grafberger/mlinspect/blob/19ca0d6ae8672249891835190c9e2d9d3c14f28f/example_pipelines/healthcare/healthcare.py。

- LC117

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- maxymoo · Accepted Answer

我通常使用 FeatureUnion，使用 FunctionTransformer 提取相关列。

重要提示：

您必须使用def来定义函数，因为如果要 pickle 您的模型，FunctionTransformer 中不能使用lambda或partial，这是令人烦恼的。
您需要使用validate=False初始化FunctionTransformer。

像这样：

from sklearn.pipeline import make_union, make_pipeline
from sklearn.preprocessing import FunctionTransformer

def get_text_cols(df):
    return df[['name', 'fruit']]

def get_num_cols(df):
    return df[['height','age']]

vec = make_union(*[
    make_pipeline(FunctionTransformer(get_text_cols, validate=False), LabelEncoder()))),
    make_pipeline(FunctionTransformer(get_num_cols, validate=False), MinMaxScaler())))
])