统计三个单词的出现频率

4
我有以下代码来查找两个单词短语的频率。我需要为三个单词的短语做同样的事情。但是下面的代码似乎对于三个单词的短语不起作用。
from collections import Counter
import re

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)
two_words = [' '.join(ws) for ws in zip(words, words[1:])]
wordscount = {w:f for w, f in Counter(two_words).most_common() if f > 1}
wordscount
{'show makes': 2, 'makes me': 2, 'I love': 2}
4个回答

3

您可以在由生成器推导式和列表切片构成的三个单词组成的可迭代对象上使用collections.Counter

from collections import Counter

three_words = (words[i:i+3] for i in range(len(words)-2))
counts = Counter(map(tuple, three_words))
wordscount = {' '.join(word): freq for word, freq in counts.items() if freq > 1}

print(wordscount)

{'show makes me': 2}

注意我们直到最后才使用str.join,以避免不必要的重复字符串操作。此外,对于Counter,需要进行tuple转换,因为dict键必须是可哈希的。


2

我建议将功能拆分为单独的函数

def nwise(iterable, n):
    """
    Iterate over n-grams of an iterable.
    Has a bit of an overhead compared to pairwise (although only during
    initialization), so the two functions are implemented independently.
    """
    iterables = [iter(iterable) for _ in range(n)]
    for index, it in enumerate(iterables):
        for _ in range(index):
            next(it)
    yield from zip(*iterables)

那么你可以这样做。
two_words = [" ".join(bigram) for bigram in nwise(words, 2))]

并且

three_words = [" ".join(trigram) for trigram in nwise(words, 3))]

等等,您可以在此基础上使用collections.Counter

three_word_counts = Counter(" ".join(trigram) for trigram in nwise(words, 3))

这很好。我只是觉得昂贵的 str.join 应该延迟到最后的最小计数步骤过滤之前。 - jpp
@jpp 我觉得这不会成为问题,但是你也可以将 nwise(words, 3) 直接输入到计数器中,并在需要时进行 str.join - L3viathan

0

尝试使用 zip(words, words[1:], words[2:])

示例:

from collections import Counter
import re

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = re.findall(r'\w+', sentence)

three_words = [' '.join(ws) for ws in zip(words, words[1:], words[2:])]
wordscount = {w:f for w, f in Counter(three_words).most_common() if f > 1}
print( wordscount )

输出:

{'show makes me': 2}

0

这个怎么样:

from collections import Counter

sentence = "I love TV show makes me happy, I love also comedy show makes me feel like flying"
words = sentence.split()
r = Counter([' '.join(words[i:i+3]) for i in range(len(words)-3)])

>>> r.most_common()[0] #get the most common 3-words
('show makes me', 2)

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接