向NLTK停用词列表中添加和删除单词

5
我正在尝试向NLTK停用词列表中添加和删除单词:
from nltk.corpus import stopwords

stop_words = set(stopwords.words('french'))

#add words that aren't in the NLTK stopwords list
new_stopwords = ['cette', 'les', 'cet']
new_stopwords_list = set(stop_words.extend(new_stopwords))

#remove words that are in NLTK stopwords list
not_stopwords = {'n', 'pas', 'ne'} 
final_stop_words = set([word for word in new_stopwords_list if word not in not_stopwords])

print(final_stop_words)

输出:

Traceback (most recent call last):
  File "test_stop.py", line 10, in <module>
new_stopwords_list = set(stop_words.extend(new_stopwords))
AttributeError: 'set' object has no attribute 'extend'
3个回答

7

试试这个:

from nltk.corpus import stopwords

stop_words = set(stopwords.words('french'))

#add words that aren't in the NLTK stopwords list
new_stopwords = ['cette', 'les', 'cet']
new_stopwords_list = stop_words.union(new_stopwords)

#remove words that are in NLTK stopwords list
not_stopwords = {'n', 'pas', 'ne'} 
final_stop_words = set([word for word in new_stopwords_list if word not in not_stopwords])

print(final_stop_words)

2

您可以使用update代替extend,并按照以下方式替换此行new_stopwords_list = set(stop_words.extend(new_stopwords))

stop_words.update(new_stopwords)
new_stopwords_list = set(stop_words)

顺便提一下,如果您给一个叫做“list”的名称调用set,可能会让人感到困惑。

1

使用list(set(...))替代set(...),因为只有列表才有一个叫做extend的方法:

...
stop_words = list(set(stopwords.words('french')))
...

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接