TF-IDF向量化器用于提取ngram

7

我该如何使用scikit-learn库中的TF-IDF向量化器提取推文的unigrams和bigrams?我想用输出结果来训练分类器。

以下是scikit-learn的代码:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = [
    'This is the first document.',
    'This document is the second document.',
    'And this is the third one.',
    'Is this the first document?',
]
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(corpus)
2个回答

4

TfidfVectorizer有一个 ngram_range 参数,用于确定您希望在最终矩阵中作为新特征的n-gram范围。在您的情况下,您希望使用 (1,2),从unigrams到bigrams:

vectorizer = TfidfVectorizer(ngram_range=(1,2))
X = vectorizer.fit_transform(corpus).todense()

pd.DataFrame(X, columns=vectorizer.get_feature_names())

        and  and this  document  document is     first  first document  \
0  0.000000  0.000000  0.314532     0.000000  0.388510        0.388510   
1  0.000000  0.000000  0.455513     0.356824  0.000000        0.000000   
2  0.357007  0.357007  0.000000     0.000000  0.000000        0.000000   
3  0.000000  0.000000  0.282940     0.000000  0.349487        0.349487   

         is    is the   is this       one  ...       the  the first  \
0  0.257151  0.314532  0.000000  0.000000  ...  0.257151   0.388510   
1  0.186206  0.227756  0.000000  0.000000  ...  0.186206   0.000000   
2  0.186301  0.227873  0.000000  0.357007  ...  0.186301   0.000000   
3  0.231322  0.000000  0.443279  0.000000  ...  0.231322   0.349487   
...

我可以将ngrams从单词更改为字符吗? - ECub Devs

3

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接