为文件中的每个单词创建一个字典，并计算其后跟随的单词的频率。

Question

为文件中的每个单词创建一个字典，并计算其后跟随的单词的频率。

9

我正在尝试解决一个困难的问题，但很迷茫。

这是我的任务：

INPUT: file
OUTPUT: dictionary

Return a dictionary whose keys are all the words in the file (broken by
whitespace). The value for each word is a dictionary containing each word
that can follow the key and a count for the number of times it follows it.

You should lowercase everything.
Use strip and string.punctuation to strip the punctuation from the words.

Example:
>>> #example.txt is a file containing: "The cat chased the dog."
>>> with open('../data/example.txt') as f:
...     word_counts(f)
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}

以下是我目前所做的一切，至少试图提取正确的单词：

以下是我目前为止所做的一切，至少试图提取出正确的单词：

def word_counts(f):
    i = 0
    orgwordlist = f.split()
    for word in orgwordlist:
        if i<len(orgwordlist)-1:
            print orgwordlist[i]
            print orgwordlist[i+1]

with open('../data/example.txt') as f:
    word_counts(f)

我想我需要以某种方式使用.count方法，并最终将一些字典进行压缩，但我不确定如何计算每个第一个单词的第二个单词数量。

我知道我离解决问题还有很长的路要走，但是我试图一步步来解决。任何帮助都将不胜感激，甚至只是指向正确方向的提示。

- Kristie

1

f.split()。f 是一个文件处理器，不是一个字符串。 - Willem Van Onsem

5个回答

5

我们可以在一次遍历中完成此操作：

使用defaultdict作为计数器。
遍历bigrams，即地点对，并进行原地计数。

所以... 为了简洁起见，我们将跳过规范化和清理步骤：

>>> from collections import defaultdict
>>> counter = defaultdict(lambda: defaultdict(int))
>>> s = 'the dog chased the cat'
>>> tokens = s.split()
>>> from itertools import islice
>>> for a, b in zip(tokens, islice(tokens, 1, None)):
...     counter[a][b] += 1
...
>>> counter
defaultdict(<function <lambda> at 0x102078950>, {'the': defaultdict(<class 'int'>, {'cat': 1, 'dog': 1}), 'dog': defaultdict(<class 'int'>, {'chased': 1}), 'chased': defaultdict(<class 'int'>, {'the': 1})})

并且更易读的输出：

>>> {k:dict(v) for k,v in counter.items()}
{'the': {'cat': 1, 'dog': 1}, 'dog': {'chased': 1}, 'chased': {'the': 1}}
>>>

- juanpa.arrivillaga

1

“the cat chased the dog” 是正确的句子 :D - MooingRawr

1

首先，那只追赶狗的猫非常勇敢！其次，这有点棘手，因为我们不经常与这种解析交互。以下是代码：

k = "The cat chased the dog."
sp = k.split()
res = {}
prev = ''
for w in sp:
    word = w.lower().replace('.', '')
    if prev in res:
        if word.lower() in res[prev]:
            res[prev][word] += 1
        else:
            res[prev][word] = 1
    elif not prev == '':
        res[prev] = {word: 1}
    prev = word
print res

- cookiedough

1

“因为我们不是每天都与这种类型的解析交互。” 我们不是吗？这是自然语言处理中的基本处理。这被称为2-gram。 - Willem Van Onsem

1

谢谢！现在我知道了！ - cookiedough

@WillemVanOnsem，也被称为Bigram - juanpa.arrivillaga

1

您可以：

创建一个去除标点符号的单词列表；
使用 zip(list_, list_[1:]) 或任何按成对迭代的方法创建单词对；
创建一个字典，其中第一对单词是键，后面跟着该对单词的列表；
统计单词列表中的单词。

如下所示：

from collections import Counter
s="The cat chased the dog."
li=[w.lower().strip('.,') for w in s.split()] # list of the words
di={}                                         
for a,b in zip(li,li[1:]):                    # words by pairs
    di.setdefault(a,[]).append(b)             # list of the words following first

di={k:dict(Counter(v)) for k,v in di.items()} # count the words
>>> di
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}

如果您有一个文件，只需从文件中读取到一个字符串并继续操作。

或者，您可以

执行相同的前两个步骤
使用一个具有计数器作为工厂的defaultdict。

就像这样:

from collections import Counter, defaultdict
li=[w.lower().strip('.,') for w in s.split()]
dd=defaultdict(Counter)
for a,b in zip(li, li[1:]):
    dd[a][b]+=1

>>> dict(dd)
{'the': Counter({'dog': 1, 'cat': 1}), 'chased': Counter({'the': 1}), 'cat': Counter({'chased': 1})}

或者，

>>> {k:dict(v) for k,v in dd.items()}   
{'the': {'dog': 1, 'cat': 1}, 'chased': {'the': 1}, 'cat': {'chased': 1}}

- dawg

0

我认为这是一种不需要导入defaultdict的单次解决方案。此外，它还具有去除标点符号的功能。我已经尝试将其针对大文件或重复打开文件进行了优化。

from itertools import islice

class defaultdictint(dict):
    def __missing__(self,k):
        r = self[k] = 0
        return r

class defaultdictdict(dict):
    def __missing__(self,k):
        r = self[k] = defaultdictint()
        return r

keep = set('1234567890abcdefghijklmnopqrstuvwxy ABCDEFGHIJKLMNOPQRSTUVWXYZ')

def count_words(file):
    d = defaultdictdict()
    with open(file,"r") as f:
        for line in f:
            line = ''.join(filter(keep.__contains__,line)).strip().lower().split()
            for one,two in zip(line,islice(line,1,None)):
                d[one][two] += 1
    return d

print (count_words("example.txt"))

将会输出：

{'chased': {'the': 1}, 'cat': {'chased': 1}, 'the': {'dog': 1, 'cat': 1}}

- ragardner

1

您没有关闭打开的文件。 - Matias Cicero

@MatiasCicero 啊，我一直在想这是否是问题所在，现在已经修复了，我想。 - ragardner

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Willem Van Onsem · Accepted Answer

我们可以通过“两遍扫描”来解决这个问题：

在第一遍扫描中，我们使用`zip(..)`构建一个`Counter`并计算连续两个单词的元组；
然后我们将该`Counter`转换为一个字典嵌套字典。

这将导致以下代码：

from collections import Counter, defaultdict

def word_counts(f):
    st = f.read().lower().split()
    ctr = Counter(zip(st,st[1:]))
    dc = defaultdict(dict)
    for (k1,k2),v in ctr.items():
        dc[k1][k2] = v
    return dict(dc)