在sklearn的TfidfVectorizer中添加停用词列表中的单词

Question

在sklearn的TfidfVectorizer中添加停用词列表中的单词

pythonscikit-learnclassificationstop-wordstext-classification

29

我想在TfidfVectorizer中添加一些停用词。我按照此解决方案，将我的停用词列表中包含了英文的停用词和我指定的停用词。但是TfidfVectorizer仍然不接受我的停用词列表，我仍然可以在我的特征列表中看到那些词语。以下是我的代码：

from sklearn.feature_extraction import text
my_stop_words = text.ENGLISH_STOP_WORDS.union(my_words)

vectorizer = TfidfVectorizer(analyzer=u'word',max_df=0.95,lowercase=True,stop_words=set(my_stop_words),max_features=15000)
X= vectorizer.fit_transform(text)

我还尝试将TfidfVectorizer中的stop_words设置为stop_words=my_stop_words。但仍然无法正常工作。请帮忙。

- ac11

我使用了你的代码并在这里运行了它。我得到了预期的结果。你能提供更多细节吗？ - Gurupad Hegde

3

我看到特征列表中没有出现任何停用词，所以报告的行为是符合预期的。这里，用于过滤这些哈希值的方法是错误的。如果你将随机字符串作为停用词传递给文本向量化器，它不会智能地过滤相似的字符串。停用词是要被精确/硬编码过滤的字符串。或者，在将文本块传递给向量化器之前，你可以使用正则表达式来过滤所有不需要的URL，这可能会解决你的URL问题。 - Gurupad Hegde

@ac11，这对我没用。你使用的sklearn版本是什么？ - Radu Gheorghiu

嘿...这是我去年11月做的一个课程项目。我甚至卸载了sklearn。我不知道还有什么其他方法可以检查那个版本。抱歉。 - ac11

可能是 [将单词添加到Scikit-learn的CountVectorizer停用词列表中] 的重复问题。(https://dev59.com/OWAf5IYBdhLWcg3wsUXT) - Vivek Kumar

显示剩余5条评论

3个回答

5

这里有答案：https://dev59.com/OWAf5IYBdhLWcg3wsUXT#24386751 虽然sklearn.feature_extraction.text.ENGLISH_STOP_WORDS是一个不可变集合，但你可以复制它并添加自己的单词，然后将该变量作为列表传递给stop_words参数。

- yanhan

0

如果要与scikit-learn一起使用，您也可以使用列表:

from nltk.corpus import stopwords
stop = list(stopwords.words('english'))
stop.extend('myword1 myword2 myword3'.split())


vectorizer = TfidfVectorizer(analyzer = 'word',stop_words=set(stop))
vectors = vectorizer.fit_transform(corpus)
...

这种方法唯一的缺点就是，列表可能包含重复项，因此在将其用作 TfidfVectorizer 的参数时，我会将其转换回来。

- user2589273

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Pedram · Accepted Answer

这是您可以做到的方法：

from sklearn.feature_extraction import text
from sklearn.feature_extraction.text import TfidfVectorizer

my_stop_words = text.ENGLISH_STOP_WORDS.union(["book"])

vectorizer = TfidfVectorizer(ngram_range=(1,1), stop_words=my_stop_words)

X = vectorizer.fit_transform(["this is an apple.","this is a book."])

idf_values = dict(zip(vectorizer.get_feature_names(), vectorizer.idf_))

# printing the tfidf vectors
print(X)

# printing the vocabulary
print(vectorizer.vocabulary_)

在这个例子中，我为两个示例文档创建了tfidf向量：

"This is a green apple."
"This is a machine learning book."

默认情况下，this、is、a和an都在ENGLISH_STOP_WORDS列表中。此外，我还将book添加到停用词列表中。以下是输出：

(0, 1)  0.707106781187
(0, 0)  0.707106781187
(1, 3)  0.707106781187
(1, 2)  0.707106781187
{'green': 1, 'machine': 3, 'learning': 2, 'apple': 0}

如我们所见，单词book也被从特征列表中删除，因为我们将它列为停用词。因此，tfidfvectorizer接受了手动添加的单词作为停用词，并在创建向量时忽略了该单词。