如何将不同的输入适配到一个sklearn管道中?

9

我正在使用sklearn的Pipeline来对文本进行分类。

在这个例子的Pipeline中,我使用了一个TfIDF向量化器和一些自定义特征,将它们与FeatureUnion包装起来,并将分类器作为Pipeline的步骤。然后我拟合训练数据并进行预测:

from sklearn.pipeline import FeatureUnion, Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC

X = ['I am a sentence', 'an example']
Y = [1, 2]
X_dev = ['another sentence']

# load custom features and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
features.append(('ngram', countVecWord))

all_features = FeatureUnion(features)

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)

pipeline = Pipeline(
    [('all', all_features ),
    ('clf', LinearSVC1),
    ])

pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

# etc.

以上代码运行正常,但有一个细节。我想对文本进行词性标注,并在标记文本上使用不同的向量化器。
X = ['I am a sentence', 'an example']
X_tagged = do_tagging(X) 
# X_tagged = ['PP AUX DET NN', 'DET NN']
Y = [1, 2]
X_dev = ['another sentence']
X_dev_tagged = do_tagging(X_dev)

# load custom featues and FeatureUnion with Vectorizer
features = []
measure_features = MeasureFeatures() # this class includes my custom features
features.append(('measure_features', measure_features))

countVecWord = TfidfVectorizer(ngram_range=(1, 3), max_features= 4000)
# new POS Vectorizer
countVecPOS = TfidfVectorizer(ngram_range=(1, 4), max_features= 2000)

features.append(('ngram', countVecWord))
features.append(('pos_ngram', countVecWord))

all_features = FeatureUnion(features)

# classifier
LinearSVC1 = LinearSVC(tol=1e-4,  C = 0.10000000000000001)

pipeline = Pipeline(
    [('all', all_features ),
    ('clf', LinearSVC1),
    ])

# how do I fit both X and X_tagged here
# how can the different vectorizers get either X or X_tagged?
pipeline.fit(X, Y)
y_pred = pipeline.predict(X_dev)

# etc.

如何正确地适配这种类型的数据?这两个向量化器如何区分原始文本和POS文本?我的选择有哪些?

我还有自定义特征,其中一些需要使用原始文本,而另一些需要使用POS文本。

编辑:添加MeasureFeatures()

from sklearn.base import BaseEstimator
import numpy as np

class MeasureFeatures(BaseEstimator):

    def __init__(self):
        pass

    def get_feature_names(self):
        return np.array(['type_token', 'count_nouns'])

    def fit(self, documents, y=None):
        return self

    def transform(self, x_dataset):


        X_type_token = list()
        X_count_nouns = list()

        for sentence in x_dataset:

            # takes raw text and calculates type token ratio
            X_type_token.append(type_token_ratio(sentence))

            # takes pos tag text and counts number of noun pos tags (NN, NNS etc.)
            X_count_nouns.append(count_nouns(sentence))

        X = np.array([X_type_token, X_count_nouns]).T

        print X
        print X.shape

        if not hasattr(self, 'scalar'):
            self.scalar = StandardScaler().fit(X)
        return self.scalar.transform(X)

这个特征转换器需要接收标记文本以供count_nouns()函数使用,或者接收原始文本以供type_token_ratio()函数使用。

1个回答

7
我认为您需要在两个转换器(TfidfTransformer和POSTransformer)上进行FeatureUnion。当然,您需要定义POSTransformer。
或许这篇文章可以帮到您:article
您的管道可能会像这样。
pipeline = Pipeline([
  ('features', FeatureUnion([
    ('ngram_tf_idf', Pipeline([
      ('counts_ngram', CountVectorizer()),
      ('tf_idf_ngram', TfidfTransformer())
    ])),
    ('pos_tf_idf', Pipeline([
      ('pos', POSTransformer()),          
      ('counts_pos', CountVectorizer()),
      ('tf_idf_pos', TfidfTransformer())
    ])),
    ('measure_features', MeasureFeatures())
  ])),
  ('classifier', LinearSVC())
])

假设MeasureFeaturesPOSTransformer都是符合sklearn API的Transformer。

我在最新的编辑中添加了MeasureFeatures()。基本上,它需要获取一组特征的原始文本和另一组特征的pos标记集。是否使用两个MeasureFeature类会有帮助?一个用于原始文本特征,另一个用于pos标记特征? - Ivan Bilan
我并没有看到你的工作流程。看看我向你提出的那个,链接和这个例子(http://scikit-learn.org/stable/auto_examples/hetero_feature_union.html)。之后,你只需要考虑你的工作流程,你的数据会发生什么。 - dooms

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接