如何在Python中查找单词序列？

Question

如何在Python中查找单词序列？

4

我有一个像这个example.txt的大文本文件：
http://www.fullbooks.com/The-Jacket-Star-Rover-1.html
使用awk命令：

cat example.txt | awk '{ print substr($0, index($0,$3)) }' | tr -sc "[A-Z][a-z][0-9]'" '[\012*]' | awk -- 'first!=""&&second!="" { print first,second,$0; } { first=second; second=$0; }' | sort | uniq -c | sort -nr | head -n20

输出结果是最常出现的三个连续单词的前20名排名：

 13 in the jacket
 11 I was a
 10 of the Yard
 10 me in the
  8 Captain of the
  7 times and places
  7 the Captain of
  7 in the prison
  7 in the dungeons
  7 in San Quentin
  7 I had been
  6 other times and
  6 hours in the
  6 are going to
  5 twenty four hours
  5 to take me
  5 the rest of
  5 the forty lifers
  5 the Board of
  5 that I had

从以下开始：

raw=open('examples.txt')
text=raw.read().replace('\n', '')
words = text.split()
...............

如何使用Python3获得相同的结果？

- colbalt011

返回翻译后的文本：单词或短语？ - bluszcz

你能更好地解释一下，以便那些不了解awk及其语法的人也能理解吗？ - Lupanoide

2个回答

0

你可以尝试这个简单的实现：

import re

frequency={}
with open('example.txt') as raw:
    words = [word.lower() for word in re.split("\W",raw.read()) if word]

for index, word in enumerate(words):
    if index < (len(words)-2):
        triplet = (word, words[index+1], words[index+2])
        if triplet in frequency:
            frequency[triplet] += 1
        else:
            frequency[triplet] = 1

for triplet, rank in frequency.items():
    print(triplet,str(rank))

- jimidime

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jean-François Fabre · Accepted Answer

这是计算单词频率的一个不错的变体，但并没有太大区别。我会：

读取文件并像你一样切分
创建三元组并将其提供给collections.Counter（使用tuple类型使其可哈希化）
过滤/排序以显示出现次数超过5次的单词

就像这样：

import collections

with open('example.txt') as raw:
    words = raw.read().split()

c = collections.Counter(tuple(words[i:i+3]) for i in range(len(words)-3))
for x in sorted([(k,v) for k,v in c.items() if v>=5] ,key = lambda x : x[1],reverse=True):
    print(x)

请注意，如果存在标点符号（例如，"Hello, World"会被分割成"Hello,"和"World"），仅使用str.split()进行分割效果不佳。因此，我们最好使用正则表达式在非字母数字字符上进行分割：

words = [x for x in re.split("\W",raw.read()) if x]

我得到了比使用简单的 str.split 更多的结果：

(('in', 'the', 'jacket'), 19)
(('of', 'the', 'Yard'), 13)
(('Captain', 'of', 'the'), 12)
(('I', 'was', 'a'), 12)
(('me', 'in', 'the'), 11)
(('in', 'the', 'prison'), 11)
(('in', 'the', 'dungeons'), 10)
(('hours', 'in', 'the'), 9)
(('in', 'San', 'Quentin'), 9)
(('I', 'don', 't'), 8)
(('He', 'was', 'a'), 8)
(('are', 'going', 'to'), 8)
(('I', 'had', 'been'), 7)
(('I', 'have', 'been'), 7)
(('in', 'order', 'to'), 7)
(('times', 'and', 'places'), 7)
(('five', 'pounds', 'of'), 7)
(('and', 'I', 'have'), 7)
(('the', 'Captain', 'of'), 7)
(('Darrell', 'Standing', 's'), 6)
(('I', 'did', 'not'), 6)
(('five', 'years', 'of'), 6)
(('Warden', 'Atherton', 'and'), 6)
(('Board', 'of', 'Directors'), 6)
(('thirty', 'five', 'pounds'), 6)
(('that', 'I', 'had'), 6)
(('pounds', 'of', 'dynamite'), 6)
(('other', 'times', 'and'), 6)
(('of', 'San', 'Quentin'), 5)
(('the', 'forty', 'lifers'), 5)
(('and', 'Captain', 'Jamie'), 5)
(('I', 'Darrell', 'Standing'), 5)
(('in', 'the', 'dungeon'), 5)
(('going', 'to', 'take'), 5)
...

为了合并句首单词（例如"in the woods"和"In the woods"），我们可以选择将单词转换为小写字母，以获得不同的结果。