停用词NLTK/Python问题

Question

停用词NLTK/Python问题

5

我有一些处理数据集以备后用的代码，停用词的代码似乎没问题，但是我认为问题在于其余的代码，因为它似乎只删除了一些停用词。

import re
import nltk

# Quran subset
filename = 'subsetQuran.txt'

# create list of lower case words
word_list = re.split('\s+', file(filename).read().lower())
print 'Words in text:', len(word_list)

word_list2 = [w for w in word_list if not w in nltk.corpus.stopwords.words('english')]



# create dictionary of word:frequency pairs
freq_dic = {}
# punctuation and numbers to be removed
punctuation = re.compile(r'[-.?!,":;()|0-9]') 
for word in word_list2:
    # remove punctuation marks
    word = punctuation.sub("", word)
    # form dictionary
    try: 
        freq_dic[word] += 1
    except: 
        freq_dic[word] = 1


print '-'*30

print "sorted by highest frequency first:"
# create list of (val, key) tuple pairs
freq_list2 = [(val, key) for key, val in freq_dic.items()]
# sort by val or frequency
freq_list2.sort(reverse=True)
freq_list3 = list(freq_list2)
# display result
for freq, word in freq_list2:
    print word, freq
f = open("wordfreq.txt", "w")
f.write( str(freq_list3) )
f.close()

输出结果如下所示。

[(71, 'allah'), (65, 'ye'), (46, 'day'), (21, 'lord'), (20, 'truth'), (20, 'say'), (20, 'and')

这只是一个小样本，还有其他应该被删除的内容。

任何帮助都将不胜感激。

- Alex

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Rafi · Accepted Answer

4

尝试在创建word_list2时剔除单词。

word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')]

- Rafi

1

if not w in ... 或者 if w not in ...？ - eumiro

1

是的。（为了澄清：假设文本中有“yes ... and, no.”，那么word_list将包含yes，...，and,，no.，而且即使and和no是停止词，and,和no.也不会成为停止词。）【这是对Rafi的回应，不是对eumiro的回应。@eumiro，两者都可以，并且我怀疑在性能或清晰度方面没有太大的区别。】 - Gareth McCaughan

尝试这个：word_list2 = [w.strip() for w in word_list if w.strip() not in nltk.corpus.stopwords.words('english')] - Rafi

不行，这仍然让我有“和”、“的”、“它”等等。 - Alex

1

在创建word_list2之前，你应该对word_list执行punctuation.sub操作。 - Jacob

显示剩余2条评论