如何使用TF-IDF向量选择前1000个单词？

Question

如何使用TF-IDF向量选择前1000个单词？

python-3.xscikit-learntf-idfsklearn-pandastfidfvectorizer

5

我有一份包含5000个评论的文件。我在这个文件上应用了tf-idf。在这里，sample_data包含5000个评论。我将tf-idf向量化器应用于sample_data，并使用一元范围。现在，我想从sample_data中获取具有最高tf-idf值的前1000个单词。请问如何获得前1000个单词？

from sklearn.feature_extraction.text import TfidfVectorizer
tf_idf_vect = TfidfVectorizer(ngram_range=(1,1))
tf_idf_vect.fit(sample_data)
final_tf_idf = tf_idf_vect.transform(sample_data)

- merkle

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Vivek Kumar · Accepted Answer

TF-IDF 值取决于各个文档。您可以使用 TfidfVectorizer 的 max_features 参数，基于它们的数量（TF）获取前1000个术语：

max_features : int or None, default=None

If not None, build a vocabulary that only consider the top
max_features ordered by term frequency across the corpus.

Just do:

tf_idf_vect = TfidfVectorizer(ngram_range=(1,1), max_features=1000)

在文档学习（fitting）后，你甚至可以从 tf_idf_vect 中获取全局词权重（'idf'）, 只需使用 idf_ 属性：

idf_ : array, shape = [n_features], or None

  The learned idf vector (global term weights) when use_idf is set to True,

在调用tf_idf_vect.fit(sample_data)之后，请执行以下操作：

idf = tf_idf_vect.idf_

然后从中选择前1000个，根据这些选定的特征重新拟合数据。但是通过"tf-idf"无法获取前1000个，因为tf-idf是一个单词在单个文档中的tf和词汇表中的idf（全局）的乘积。因此，对于同一个单词在单个文档中出现2次的情况，其tf-idf将比在另一个文档中仅出现1次的同一单词高两倍。如何比较相同术语的不同值。希望这样清楚了。