TF-IDF向量化器用于提取ngram

Question

TF-IDF向量化器用于提取ngram

7

我该如何使用scikit-learn库中的TF-IDF向量化器提取推文的unigrams和bigrams？我想用输出结果来训练分类器。

以下是scikit-learn的代码：

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)

- ECub Devs

2个回答

3

根据文档：https://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfVectorizer.html 在初始化TfidfVectorizer时，您需要指定n-grams，例如TfidfVectorizer(ngram_range(min_n, max_n))。参数ngram_range的范围是不同n-grams的n值的上下边界，(1, 1)表示只有unigrams，(1, 2)表示unigrams和bigrams，(2, 2)表示只有bigrams。

因此，答案应为vectorizer = TfidfVectorizer(ngram_range=(1,2))。

- brain pinky

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- yatu · Accepted Answer

TfidfVectorizer有一个 ngram_range 参数，用于确定您希望在最终矩阵中作为新特征的n-gram范围。在您的情况下，您希望使用 (1,2)，从unigrams到bigrams：

vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()

pd.DataFrame(X, columns=vectorizer.get_feature_names())

        and  and this  document  document is     first  first document  \
0  0.000000  0.000000  0.314532     0.000000  0.388510        0.388510   
1  0.000000  0.000000  0.455513     0.356824  0.000000        0.000000   
2  0.357007  0.357007  0.000000     0.000000  0.000000        0.000000   
3  0.000000  0.000000  0.282940     0.000000  0.349487        0.349487   

         is    is the   is this       one  ...       the  the first  \
0  0.257151  0.314532  0.000000  0.000000  ...  0.257151   0.388510   
1  0.186206  0.227756  0.000000  0.000000  ...  0.186206   0.000000   
2  0.186301  0.227873  0.000000  0.357007  ...  0.186301   0.000000   
3  0.231322  0.000000  0.443279  0.000000  ...  0.231322   0.349487   
...