将单词添加到nltk停用词列表

Question

将单词添加到nltk停用词列表

23

我有一些从数据集中删除停用词的代码，因为停用词列表似乎没有删除我想要删除的大部分单词，所以我想要添加单词到这个停用词列表中，以便在这种情况下将它们删除。

我正在使用以下代码来删除停用词：

word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]

我对添加单词的正确语法不确定，似乎无论在哪里都找不到正确的方法。希望能得到帮助。谢谢。

- Alex

10个回答

7

import nltk
stopwords = nltk.corpus.stopwords.words('english')
new_words=('re','name', 'user', 'ct')
for i in new_words:
    stopwords.append(i)
print(stopwords)

- user2110417

3

我在Ubuntu机器上的做法是，我在root中使用ctrl + F搜索“stopwords”。它给了我一个文件夹。我进入了这个文件夹，里面有不同的文件。我打开了“english”文件，里面只有128个单词。我把我的单词添加到其中。保存并完成。

- Sankalp

2

 import nltk
 nltk.download('stopwords')
 from nltk.corpus import stopwords
 #add new words to the list
 new_stopwords = ["new", "custom", "words", "add","to","list"]
 stopwrd = nltk.corpus.stopwords.words('english')
 stopwrd.extend(new_stopwords)

- Kiran

2

我也在寻找解决方案。经过一些尝试和错误，我成功地将单词添加到停用词列表中。希望这能帮助你。

def removeStopWords(str):
#select english stopwords
cachedStopWords = set(stopwords.words("english"))
#add custom words
cachedStopWords.update(('and','I','A','And','So','arnt','This','When','It','many','Many','so','cant','Yes','yes','No','no','These','these'))
#remove stop words
new_str = ' '.join([word for word in str.split() if word not in cachedStopWords]) 
return new_str

- Aubrey_lab

2

英译中：

英文停用词是nltk/corpus/stopwords/english.txt文件中的一个文件（我猜应该在这里...但我没有在这台机器上安装nltk..最好的方法是搜索'nltk repo'中的'english.txt'）。

您可以将新的停用词添加到此文件中。

如果您的停用词列表增加到几百个，请尝试查看bloom filters。

- Rafi

有没有好的英语停用词列表？nltk的似乎相当差劲。 - fabrizioM

1

@fabrizioM 这是我在上一家公司使用的停用词列表。http://fs1.position2.com/bm/txt/stopwords.txt - Rafi

@Rafi 这个列表比NLTK的好多了！谢谢！ - tumultous_rooster

2

我总是在需要的模块顶部执行stopset = set(nltk.corpus.stopwords.words('english'))。这样很容易添加更多单词到集合中，同时成员检查更快。

- Jacob

1

我在Python中使用此代码向nltk停用词列表添加新的停用词。

from nltk.corpus import stopwords
#...#
stop_words = set(stopwords.words("english"))

#add words that aren't in the NLTK stopwords list
new_stopwords = ['apple','mango','banana']
new_stopwords_list = stop_words.union(new_stopwords)

print(new_stopwords_list)

- Jayantha

0

我发现（Python 3.7，在Windows 10上的jupyter笔记本，企业防火墙），创建一个列表并使用'append'命令会导致整个停用词列表作为原始列表的一个元素添加进去。

这样就使得'stopwords'变成了一个列表的列表。

Snijesh的答案很好，Jayantha的答案也有效。

- Barry DeCicco

0

STOP_WORDS.add("Lol") #根据需要将新的停用词添加到语料库中

- Nirmani Warakaulla

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Oziel Carneiro · Accepted Answer

你可以简单地使用append方法将单词添加到其中：

stopwords = nltk.corpus.stopwords.words('english')
stopwords.append('newWord')

或者按照Charlie在评论中建议的方式，扩展以追加单词列表。

stopwords = nltk.corpus.stopwords.words('english')
newStopWords = ['stopWord1','stopWord2']
stopwords.extend(newStopWords)