Python - 在文本文件中查找重复单词

Question

Python - 在文本文件中查找重复单词

3

我想知道你是否能帮我解决一个Python编程问题？我正在尝试编写一个程序，读取文本文件并输出“word 1 True”，如果这个单词之前已经在文件中出现过，或者“word 1 False”，如果这是这个单词第一次出现。

以下是我的代码：

fh = open(fname)
lst = list ()
for line in fh:
    words = line.split()
    for word in words:
        if word in words:
            print("word 1 True", word)
        else:
            print("word 1 False", word)

然而，它只返回“word 1 True”。

请给予建议。

谢谢！

- Sketch0482

你需要一个额外的 set 来查找单词是否已经被包含，并在没有时将其添加到集合中。 - Michael Butscher

words 中的每个 word 都将出现在 words 中，因此测试只是一种昂贵的方式来表示 if True:。如果您正在寻找重复项，则需要计数。 - ShadowRanger

4个回答

2

你可能也想追踪之前的位置，类似这样：

with open(fname) as fh:
    vocab = {}
    for i, line in enumerate(fh):
       words = line.split()
       for j, word in enumerate(words):
           if word in vocab:
               locations = vocab[word]
               print word "occurs at", locations
               locations.append((i, j))
           else:
               vocab[word] = [(i, j)]
               # print "First occurrence of", word

- khachik

这很Python风格 ;) - Juggernaut

1

这段代码片段不使用文件，但很容易测试和学习。主要区别在于您必须像在示例中一样加载文件并逐行读取。

example_file = """
This is a text file example

Let's see how many time example is typed.

"""
result = {}
words = example_file.split()
for word in words:
    # if the word is not in the result dictionary, the default value is 0 + 1
    result[word] = result.get(word, 0) + 1
for word, occurence in result.items():
    print("word:%s; occurence:%s" % (word, occurence))

更新：

如@khachik所建议，更好的解决方案是使用Counter。

>>> # Find the ten most common words in Hamlet
>>> import re
>>> words = re.findall(r'\w+', open('hamlet.txt').read().lower())
>>> Counter(words).most_common(10)
[('the', 1143), ('and', 966), ('to', 762), ('of', 669), ('i', 631),
 ('you', 554),  ('a', 546), ('my', 514), ('hamlet', 471), ('in', 451)]

- Karim N Gorjux

如果你想这样做的话，最好使用collections.Counter。 - khachik

谢谢 @khachik，我不知道Counter。谢谢。 - Karim N Gorjux

1

根据您的路线，您可以这样做：

with open('tyger.txt', 'r') as f:
    lines = (f.read()).split()
    for word in lines:
        if lines.count(word) > 1:
            print(f"{word}: True")
        else:
            print(f"{word}: Flase")

输出

(xenial)vash@localhost:~/python/stack_overflow$ python3.7 read_true.py
When: Flase
the: True
stars: Flase
threw: Flase
down: Flase
their: True
spears: Flase
...

你也可以数每个单词：

with open('tyger.txt', 'r') as f:
    count = {}
    lines = f.read()
    lines = lines.split()
    for i in lines:
        count[i] = lines.count(i)
    print(count)

输出

{'When': 1, 'the': 2, 'stars': 1, 'threw': 1, 'down': 1, 'their': 2,
'spears': 1, 'And': 1, "water'd": 1, 'heaven': 1, 'with': 1, 'tears:':
1, 'Did': 2, 'he': 2, 'smile': 1, 'his': 1, 'work': 1, 'to': 1,
'see?': 1, 'who': 1, 'made': 1, 'Lamb': 1, 'make': 1, 'thee?': 1}

您可以这样使用字典：

for k in count:
    if count[k] > 1:
        print(f"{k}: True")
    else:
        print(f"{k}: False")

Ouput

When: False
the: True
stars: False
threw: False
down: False
their: True
spears: False

- vash_the_stampede

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kingsley · Accepted Answer

一个简单（且快速）的实现方法是使用Python字典。它们可以被视为类似于数组，但索引键是字符串而不是数字。

这就产生了一些代码片段，比如：

found_words = {}    # empty dictionary
words1 = open("words1.txt","rt").read().split(' ')  # TODO - handle punctuation
for word in words1:
    if word in found_words:
        print(word + " already in file")
    else:
        found_words[word] = True    # could be set to anything

现在，在处理您的单词时，仅检查单词是否已存在于字典中即可表明它已被看到。