打印包含和不包含停用词的文本中出现最频繁的10个单词

15

这个问题我是从这里得来的,并加入了我的修改。我有以下代码:

from nltk.corpus import stopwords
def content_text(text):
    stopwords = nltk.corpus.stopwords.words('english')
    content = [w for w in text if w.lower() in stopwords]
    return content

如何打印文本中出现频率最高的10个单词,其中1)包括停用词和2)不包括停用词?


可能是重复的问题:如何在Python中计算列表项的出现次数? - ivan_pozdeev
3个回答

23

nltk 中有一个 FreqDist 函数。

import nltk
allWords = nltk.tokenize.word_tokenize(text)
allWordDist = nltk.FreqDist(w.lower() for w in allWords)

stopwords = nltk.corpus.stopwords.words('english')
allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords)    

提取10个最常见的:

mostCommon= allWordDist.most_common(10).keys()

我得到了这个错误: AttributeError: 'FreqDist' 对象没有 'most_common' 属性。 - user2064809
请问您能提供完整的清单吗? - igorushi
2
你应该使用小写字符串来查询停用词。从:allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w not in stopwords) 变为:allWordExceptStopDist = nltk.FreqDist(w.lower() for w in allWords if w.lower() not in stopwords) - abevieiramota

5

对于函数中的 is stopwords 我不是很确定,我想它应该是 in,但你可以使用一个带有 most_common(10) 的 Counterdict 来获取前 10 个最常见的词:

from collections import Counter
from string import punctuation


def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english')) # 0(1) lookups
    with_stp = Counter()
    without_stp  = Counter()
    with open(text) as f:
        for line in f:
            spl = line.split()
            # update count off all words in the line that are in stopwrods
            with_stp.update(w.lower().rstrip(punctuation) for w in spl if w.lower() in stopwords)
               # update count off all words in the line that are not in stopwords
            without_stp.update(w.lower().rstrip(punctuation)  for w in spl if w  not in stopwords)
    # return a list with top ten most common words from each 
    return [x for x in with_stp.most_common(10)],[y for y in without_stp.most_common(10)]
wth_stop, wthout_stop = content_text(...)

如果您正在传递一个nltk文件对象,只需迭代它即可:
def content_text(text):
    stopwords = set(nltk.corpus.stopwords.words('english'))
    with_stp = Counter()
    without_stp  = Counter()
    for word in text:
        # update count off all words in the line that are in stopwords
        word = word.lower()
        if word in stopwords:
             with_stp.update([word])
        else:
           # update count off all words in the line that are not in stopwords
            without_stp.update([word])
    # return a list with top ten most common words from each
    return [k for k,_ in with_stp.most_common(10)],[y for y,_ in without_stp.most_common(10)]

print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt')))

nltk方法包含标点符号,这可能不是你想要的。


1
@user2064809,我已经测试过了,对我来说运行得很好,你遇到了什么错误? - Padraic Cunningham
类型错误:需要字符串或缓冲区进行Unicode强制转换,但是发现了StreamBackedCorpusView - user2064809
我应该在 content_text() 函数里放什么内容? - user2064809
@HåkenLid,那只是复制/粘贴时出现的笔误。没有必要导入print_function - Padraic Cunningham
它运行了!谢谢。我应该在第一段代码中放入“计算机地址”:wth_stop,wthout_stop = content_text('C:\\Documents and Settings\\Application Data\\nltk_data\\corpora\\inaugural\\2009-Obama.txt') 而不是 nltk.corpus.inaugural.words('2009-Obama.txt')。但是在第二段代码中,print(content_text(nltk.corpus.inaugural.words('2009-Obama.txt'))) 可以工作!! - user2064809
显示剩余7条评论

1
你可以尝试这个:

for word, frequency in allWordsDist.most_common(10):
    print('%s;%d' % (word, frequency)).encode('utf-8')

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接