scikit-learn管道

Question

scikit-learn管道

pythonscikit-learnpipelinefeature-selection

3

我（iid）数据集中的每个样本如下所示：
x = [a_1,a_2...a_N,b_1,b_2...b_M] 我也有每个样本的标签（这是监督学习） a 特征非常稀疏（即词袋表示），而 b 特征是密集的（整数，大约有45个）

我正在使用scikit-learn，并希望使用管道GridSearchCV。

问题是：是否可以在特征类型 a 上使用一个CountVectorizer，在特征类型 b 上使用另一个CountVectorizer？

我想要的可以理解为：

pipeline = Pipeline([
    ('vect1', CountVectorizer()), #will work only on features [0,(N-1)]
    ('vect2', CountVectorizer()), #will work only on features [N,(N+M-1)]
    ('clf', SGDClassifier()), #will use all features to classify
])

parameters = {
    'vect1__max_df': (0.5, 0.75, 1.0),       # type a features only
    'vect1__ngram_range': ((1, 1), (1, 2)),  # type a features only
    'vect2__max_df': (0.5, 0.75, 1.0),       # type b features only
    'vect2__ngram_range': ((1, 1), (1, 2)),  # type b features only
    'clf__alpha': (0.00001, 0.000001),
    'clf__penalty': ('l2', 'elasticnet'),
    'clf__n_iter': (10, 50, 80),
}

grid_search = GridSearchCV(pipeline, parameters, n_jobs=-1, verbose=1)
grid_search.fit(X, y)

这是可能的吗？

@Andreas Mueller提出了一个好的想法。然而，我希望保留原始的未选择特征...因此，在管道开始之前，我无法告诉每个阶段的列索引。

例如，如果我设置CountVectorizer(max_df=0.75)，它可能会减少一些术语，原始的列索引将会改变。

谢谢

- omerbp

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Andreas Mueller · Accepted Answer

很遗憾，目前这个功能不是很好用。您需要使用FeatureUnion来连接两种特征，每个变换器都需要选择特征并进行转换。一种方法是制作一个管道，由选择列的变换器（需要自己编写）和CountVectorizer组成。这里有一个类似的例子here。该示例实际上将特征分为字典中的不同值，但您不需要这样做。另外，请查看相关问题以选择列，其中包含您需要的变换器代码。

当前代码大致如下：

make_pipeline(
    make_union(
        make_pipeline(FeatureSelector(some_columns), CountVectorizer()),
        make_pipeline(FeatureSelector(other_columns), CountVectorizer())),
    SGDClassifier())