去除停用词后的干净列表

3
这个变量:
sent=[('include', 'details', 'about', 'your performance'),
('show', 'the', 'results,', 'which', 'you\'ve', 'got')]

需要清除停用词。

我尝试使用

output = [w for w in sent if not w in stop_words]

但它没有起作用。 出了什么问题?

3个回答

8
from nltk.corpus import stopwords

stop_words = {w.lower() for w in stopwords.words('english')}

sent = [('include', 'details', 'about', 'your', 'performance'),
        ('show', 'the', 'results,', 'which', 'you\'ve', 'got')]

如果您想创建一个不包含停用词的单词列表;
>>> no_stop_words = [word for sentence in sent for word in sentence if word not in stop_words]
['include', 'details', 'performance', 'show', 'results,', 'got']

如果您想保持句子的完整性;
>>> sent_no_stop = [[word for word in sentence if word not in stop_words] for sentence in sent]
[['include', 'details', 'performance'], ['show', 'results,', 'got']]

然而,大多数情况下,您将使用一个单词列表(不含括号)。
sent = ['include', 'details', 'about', 'your performance','show', 'the', 'results,', 'which', 'you\'ve', 'got']

>>> no_stopwords = [word for word in sent if word not in stop_words]
['include', 'details', 'performance', 'show', 'results,', 'got']

3
请注意,对于任何非微不足道的大小,"stop_words" 应该是一个 "set" 而不是一个 "list"。使用 "stop_words = {w.lower() for w in stopwords.words('english')}" 来实现这一点。 - MisterMiyagi
2
请注意,集合推导式“{... for ... in ...}”即使可迭代对象为空,也将始终创建一个集合。只有字典推导式“{...: ... for ... in ...}”才会创建一个字典。 - MisterMiyagi

6

问题出在圆括号干扰了迭代。如果您能够去掉它们:

sent=['include', 'details', 'about', 'your performance','show', 'the', 'results,', 'which', 'you\'ve', 'got']
output = [w for w in sent if not w in stopwords]

如果不行,那么你可以这样做:

sent=[('include', 'details', 'about', 'your performance'),('show', 'the', 'results,', 'which', 'you\'ve', 'got')]
output = [i for s in [[w for w in l if w not in stopwords] for l in sent] for i in s]

0

你的代码中是否缺少引号?确保关闭所有字符串,并在使用相同类型的引号时用反斜杠转义你的撇号。我还会将每个单词分开,像这样:

sent=[('include', 'details', 'about', 'your', 'performance'), ('show', 'the', 'results,', 'which', 'you\'ve', 'got')]

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接