我有一个法语文本文件,想要统计出该文件中出现最频繁的单词,不考虑停用词。以下是代码:
with open('./text_file.txt', 'r', encoding='utf8') as f:
s = f.read()
num_chars = len(s)
num_lines = s.count('\n')
#call split with no arguments
words = s.split()
d = {}
for w in words:
if w in d:
d[w] += 1
else:
d[w] = 1
num_words = sum(d[w] for w in d)
lst = [(d[w],w) for w in d]
lst.sort()
lst.reverse()
# nltk treatment
from nltk.corpus import stopwords # Import the stop word list
from nltk.tokenize import wordpunct_tokenize
stop_words = set(stopwords.words('french')) # creating a set makes the searching faster
print (stop_words)
print ([word for word in lst if word not in stop_words])
print('\n The 50 most frequent words are /n')
i = 1
for count, word in lst[:50]:
print('%2s. %4s %s' %(i,count,word))
i+= 1
这将返回包括停用词在内的出现频率最高的单词。您有更好的想法吗?
stop_words
并在if w in d
中检查它们。这样你就不必先计数然后再删除它们了。 - Finn