从FeatureUnion + Pipeline中获取特征名称

Question

从FeatureUnion + Pipeline中获取特征名称

python-3.xscikit-learnnlpfeature-extraction

19

我正在使用FeatureUnion来结合从事件标题和描述中发现的特征：

union = FeatureUnion(
    transformer_list=[
    # Pipeline for pulling features from the event's title
        ('title', Pipeline([
            ('selector', TextSelector(key='title')),
            ('count', CountVectorizer(stop_words='english')),
        ])),

        # Pipeline for standard bag-of-words model for description
        ('description', Pipeline([
            ('selector', TextSelector(key='description_snippet')),
            ('count', TfidfVectorizer(stop_words='english')),
        ])),
    ],

    transformer_weights ={
            'title': 1.0,
            'description': 0.2
    },
)

然而，调用 union.get_feature_names() 时出现错误："Transformer title (type Pipeline) does not provide get_feature_names." 我想查看由不同的Vectorizer生成的一些特征。我该怎么做？

- Huey

在调用 union.get_feature_names() 时，您是否遇到任何错误？ - Vivek Kumar

1

这是错误信息：“转换器标题（类型为Pipeline）不提供get_feature_names方法。” - Huey

你可能想看一下另一个类似问题的答案：https://dev59.com/aF4b5IYBdhLWcg3wqDWj#58359509 - Guillaume Chevalier

2个回答

5

您可以通过以下方式将不同的向量化器作为嵌套特征进行调用（感谢 edesz）：

pipevect= dict(pipeline.named_steps['union'].transformer_list).get('title').named_steps['count']

然后你需要获取TfidfVectorizer()实例，以便在另一个函数中使用：

Show_most_informative_features(pipevect,
       pipeline.named_steps['classifier'], n=MostIF)
# or direct   
print(pipevect.get_feature_names())

- Max Kleiner

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hamel · Accepted Answer

这是因为您正在使用名为TextSelector的自定义转换器。在TextSelector中是否实现了get_feature_names方法？

如果您想让它正常工作，就必须在自定义转换器中实现此方法。

以下是一个具体的示例：

from sklearn.datasets import load_boston
from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.base import TransformerMixin
import pandas as pd

dat = load_boston()
X = pd.DataFrame(dat['data'], columns=dat['feature_names'])
y = dat['target']

# define first custom transformer
class first_transform(TransformerMixin):
    def transform(self, df):
        return df

    def get_feature_names(self):
        return df.columns.tolist()


class second_transform(TransformerMixin):
    def transform(self, df):
        return df

    def get_feature_names(self):
        return df.columns.tolist()



pipe = Pipeline([
       ('features', FeatureUnion([
                    ('custom_transform_first', first_transform()),
                    ('custom_transform_second', second_transform())
                ])
        )])

>>> pipe.named_steps['features']_.get_feature_names()
['custom_transform_first__CRIM',
 'custom_transform_first__ZN',
 'custom_transform_first__INDUS',
 'custom_transform_first__CHAS',
 'custom_transform_first__NOX',
 'custom_transform_first__RM',
 'custom_transform_first__AGE',
 'custom_transform_first__DIS',
 'custom_transform_first__RAD',
 'custom_transform_first__TAX',
 'custom_transform_first__PTRATIO',
 'custom_transform_first__B',
 'custom_transform_first__LSTAT',
 'custom_transform_second__CRIM',
 'custom_transform_second__ZN',
 'custom_transform_second__INDUS',
 'custom_transform_second__CHAS',
 'custom_transform_second__NOX',
 'custom_transform_second__RM',
 'custom_transform_second__AGE',
 'custom_transform_second__DIS',
 'custom_transform_second__RAD',
 'custom_transform_second__TAX',
 'custom_transform_second__PTRATIO',
 'custom_transform_second__B',
 'custom_transform_second__LSTAT']

请记住，Feature Union将连接从每个转换器的相应get_feature_names发出的两个列表。这就是为什么当一个或多个转换器没有此方法时会出现错误的原因。

但是，我可以看出单独做这件事情并不能解决你的问题，因为Pipeline对象中没有get_feature_names方法，而你有嵌套的流水线（在Feature Union中的流水线）。所以你有两个选择：

子类化Pipeline并自己添加get_feature_names方法，该方法从链中的最后一个转换器获取特征名称。
自己从每个转换器中提取特征名称，这将要求你自己从管道中获取那些转换器，并对它们调用get_feature_names。

另外，请记住，许多内置的sklearn转换器不适用于DataFrame，而是在各个转换器之间传递numpy数组，因此如果你要链接大量转换器，请小心处理。但我认为这已经足够给你一个了解正在发生的事情的信息。

还有一件事，请查看sklearn-pandas。我自己没有使用过它，但它可能为您提供解决方案。