在Python 3中如何从用户输入中计算bigrams?

3

我卡住了,需要一些指导。我正在努力独立学习使用Grok Learning的Python。下面是问题和示例输出以及我的代码中的位置。我感激任何能帮助我解决这个问题的提示。

In linguistics, a bigram is a pair of adjacent words in a sentence. The sentence "The big red ball." has three bigrams: The big, big red, and red ball.

Write a program to read in multiple lines of input from the user, where each line is a space-separated sentence of words. Your program should then count up how many times each of the bigrams occur across all input sentences. The bigrams should be treated in a case insensitive manner by converting the input lines to lowercase. Once the user stops entering input, your program should print out each of the bigrams that appear more than once, along with their corresponding frequencies. For example:

Line: The big red ball
Line: The big red ball is near the big red box
Line: I am near the box
Line: 
near the: 2
red ball: 2
the big: 3
big red: 3

我还没有很好地完成我的代码,目前遇到了困境。但是以下是我的进展:

words = set()
line = input("Line: ")
while line != '':
  words.add(line)
  line = input("Line: ")

我是否做对了?尽量不要导入任何模块,只使用内置功能。

谢谢, Jeff


嗨@Jeff,当处理这样的问题时,请不要考虑实际代码。尝试用英语描述它们。步骤1读取输入行。步骤2将行分成二元组,步骤3计算二元组。在你有所需做的概述之前,编码是困难的。您的第一组代码几乎完成了第1步读取输入。请参阅inspectorG4dget的答案,以了解第1步。 - nelaaro
好的,我现在比前几天进展更多了。进步就是进步 :) 谢谢!! - Jeff Singleton
4个回答

6

让我们从一个接收句子(带标点符号)并返回所有小写二元组列表的函数开始。

因此,我们首先需要从句子中去除所有非字母数字字符,将所有字母转换为小写形式,然后按空格将句子拆分为单词列表:

import re

def bigrams(sentence):
    text = re.sub('\W', ' ', sentence.lower())
    words = text.split()
    return zip(words, words[1:])

我们将使用标准(内置)re包来进行基于正则表达式的非字母数字字符替换为空格操作,并使用内置的zip函数将连续的单词成对组合。(我们将相同的单词列表与相同列表中向右移动一个元素的列表进行配对。)
现在我们可以进行测试:
>>> bigrams("The big red ball")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams("THE big, red, ball.")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]
>>> bigrams(" THE  big,red,ball!!?")
[('the', 'big'), ('big', 'red'), ('red', 'ball')]

接下来,为了在每个句子中统计二元组出现的次数,您可以使用 collections.Counter

例如,像这样:

from collections import Counter

counts = Counter()
for line in ["The big red ball", "The big red ball is near the big red box", "I am near the box"]:
    counts.update(bigrams(line))

我们得到:
>>> Counter({('the', 'big'): 3, ('big', 'red'): 3, ('red', 'ball'): 2, ('near', 'the'): 2, ('red', 'box'): 1, ('i', 'am'): 1, ('the', 'box'): 1, ('ball', 'is'): 1, ('am', 'near'): 1, ('is', 'near'): 1})

现在我们只需要打印出那些出现超过一次的内容:
for bigr, cnt in counts.items():
    if cnt > 1:
        print("{0[0]} {0[1]}: {1}".format(bigr, cnt))

全部放在一起,使用循环输入用户的内容,而不是固定列表:
import re
from collections import Counter

def bigrams(sentence):
    text = re.sub('\W', ' ', sentence.lower())
    words = text.split()
    return zip(words, words[1:])

counts = Counter()
while True:
    line = input("Line: ")
    if not line:
        break
    counts.update(bigrams(line))

for bigr, cnt in counts.items():
    if cnt > 1:
        print("{0[0]} {0[1]}: {1}".format(bigr, cnt))

输出结果:
Line: The big red ball
Line: The big red ball is near the big red box
Line: I am near the box
Line: 
near the: 2
red ball: 2
big red: 3
the big: 3

1
对于最大程度利用内置电池,我会再次给予+1的评价。 - 9000
唯一的问题是Grok Learning不喜欢导入像re这样的模块。他们希望我通过使用内置功能来学习。非常感谢您的帮助,我会尽力从中学到东西。 - Jeff Singleton
@JeffSingleton,re仅用于去除标点符号,如果不需要可以跳过这部分(尽管我想在我的答案中展示预处理也很重要)。Counter也可以简单地(重新)实现 - 但是Python的美妙之处实际上在于拥有所有这些好东西。这就是为什么Python的座右铭是“电池包含在内”。 - randomir

2
words = []
while True:
    line = input("Line: ").strip().lower()
    if not line: break
    words.extend(line.split())


counts = {}
for t in zip(words[::2], words[1::2]):
    if t not in counts: counts[t] = 0
    counts[t] += 1

谢谢@inspectorG4dget。我寻求指导,你给了我。我仍在努力解决问题,但这帮助我度过了难关。 - Jeff Singleton

0
usr_input = "Here is a sentence without multiple bigrams. Without multiple bigrams, we cannot test a sentence."

def get_bigrams(word_string):
    words = [word.lower().strip(',.') for word in word_string.split(" ")]
    pairs = ["{} {}".format(w, words[i+1]) for i, w in enumerate(words) if i < len(words) - 1]
    bigrams = {}

    for bg in pairs:
        if bg not in bigrams:
            bigrams[bg] = 0
        bigrams[bg] += 1
    return bigrams

print(get_bigrams(usr_input))

0

仅使用OP提到的Grok Learning Python课程先前模块中学到的知识,这段代码可以很好地完成所需的任务。

counts = {} # this creates a dictionary for the bigrams and the tally for each one
n = 2
a = input('Line: ').lower().split() # the input is converted into lowercase, then split into a list
while a:
  for x in range(n, len(a)+1):
    b = tuple(a[x-2:x]) # the input gets sliced into pairs of two words (bigrams)
    counts[b] = counts.get(b,0) + 1 # adding the bigrams as keys to the dictionary, with their count value set to 1 initially, then increased by 1 thereafter
  a = input('Line: ').lower().split()  
for c in counts:
  if counts[c] > 1: # tests if the bigram occurs more than once
    print(' '.join(c) + ':', counts[c]) # prints the bigram (making sure to convert the key from a tuple into a string), with the count next to it

注意:您可能需要向右滚动才能完全查看代码上的注释。
这很简单,不需要导入任何东西等。我意识到我来参与讨论有些晚了,但希望其他正在学习同一门课程或遇到类似问题的人会发现这个答案有用。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接