Python: 寻找与另一个字符串最接近的字符串（��自列表）

Question

Python: 寻找与另一个字符串最接近的字符串（��自列表）

106

假设我有一个字符串 "Hello" 和一个列表

words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo','question', 'Hallo', 'format']

如何在列表words中找到与"Hello"最接近的n个单词？

在这种情况下，我们会有['hello'，'hallo'，'Hallo'，'hi'，'format'...]

因此，策略是将列表words按从最接近单词到最远的顺序排序。

我考虑了类似以下的代码：

word = 'Hello'
for i, item in enumerate(words):
    if lower(item) > lower(word):
      ...

但是在大型列表中非常缓慢。

更新： difflib 可以使用，但在大型列表中也非常缓慢。(单词列表内含 630000+ 单词（已排序，每行一个））。因此，对于每个搜索最接近的单词，检查列表需要 5 到 7 秒钟！

- Laura

9

或许你正在寻找像编辑距离或莱文斯坦距离这样的东西？ - tripleee

4

没问题，这很大程度上取决于你对“最接近”一词的定义。 - Gareth Latty

1

这630,000个单词已经排序了吗？它们是以每行一个单词的方式存储在一个文件中吗？ - Peter Wood

你打算如何定义“最接近”？在你的示例代码中，你使用了字典顺序比较，但是它将“hermitage”排在“jello”之前作为“hello”的更好匹配。 - Nick Johnson

你找到了一个高效的解决方案来处理600万个字典条目吗？我也卡在这里了。 - Nick

5个回答

34

有一篇很棒的文章由Peter Norvig提供了完整的源代码（21行）用于拼写纠正。

http://norvig.com/spell-correct.html

其思路是构建单词的所有可能编辑。

hello - helo   - deletes    
hello - helol  - transpose    
hello - hallo  - replaces    
hello - heallo - inserts    


def edits1(word):
   splits     = [(word[:i], word[i:]) for i in range(len(word) + 1)]
   deletes    = [a + b[1:] for a, b in splits if b]
   transposes = [a + b[1] + b[0] + b[2:] for a, b in splits if len(b)>1]
   replaces   = [a + c + b[1:] for a, b in splits for c in alphabet if b]
   inserts    = [a + c + b     for a, b in splits for c in alphabet]
   return set(deletes + transposes + replaces + inserts)

现在，在你的列表中查找每个编辑。

Peter的文章很棒，值得一读。

- Amjith

谢谢，这是一个很好的发现。我正在编写一个集成词典，这可能会有用。 - laurasia

1

创建一个排序后的单词列表，并使用bisect模块来确定根据排序顺序单词应该放置在排序列表中的位置。基于该位置，您可以提供k个最近邻居以上和以下，以找到2k个最接近的单词。

- user1308520

1

你确定这会起作用吗？我觉得 OP 不想要字典序排序... - Oleh Prypin

考虑问题中的代码片段，这样一个简单的解决方案可能就是所需要的。 - user1308520

1

如果考虑可能的拼写错误，特别是在单词开头犯错误，这种方法将无法正常工作。但对于正确的单词来说，这是一个很好的解决方案。 - laurasia

如果我没记错的话，这应该不起作用。例如：zoom和boom有相似的拼写，但它们将位于您排序列表的两端。你怎么匹配它们？ - hafiz031

1

我在查看这个答案，以获取最接近列表或可能替代项的匹配

difflib.get_close_matches

我发现 Amjith基于Peter Norwig的帖子的答案，认为这可能是一个很好的替代品。在意识到它可能不太适合我的用例之前，我已经把它做成了一个类。因此，这是一个可以用于不同单词集的拼写版本。也许最好的用法可以用于人口名称匹配。

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

# WORDS = Counter(words(open('big.txt').read()))


class WordMatcher:

    def __init__(self, big_text):
        self.WORDS=Counter(words(big_text))
        self.N = sum(self.WORDS.values())

    def P(self, word):
        "Probability of `word`."
        return self.WORDS[word] / self.N

    def correction(self, word):
        "Most probable spelling correction for word."
        return max(self.candidates(word), key=self.P)

    def candidates(self, word):
        "Generate possible spelling corrections for word."
        return (self.known([word]) or self.known(self.edits1(word)) or self.known(self.edits2(word)) or [word])

    def known(self, words):
        "The subset of `words` that appear in the dictionary of WORDS."
        return set(w for w in words if w in self.WORDS)

    def edits1(self, word):
        "All edits that are one edit away from `word`."
        letters    = 'abcdefghijklmnopqrstuvwxyz'
        splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
        deletes    = [L + R[1:]               for L, R in splits if R]
        transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
        replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
        inserts    = [L + c + R               for L, R in splits for c in letters]
        return set(deletes + transposes + replaces + inserts)

    def edits2(self, word):
        "All edits that are two edits away from `word`."
        return (e2 for e1 in self.edits1(word) for e2 in self.edits1(e1))

- Dark Light

-4

也许堆可以帮助你。

你有一个名为 Heap 的堆，直到它的大小小于 n，你使用函数 close [显示这个字符串是否比另一个字符串更接近] 将单词插入到 Heap 中。

当 n 很小时，这种方法可以帮助你 :)

Heap = []
for word in words:
    if len(Heap)<n:
       Heap.insert(word)
    else
       if close(word,Heap[0]): # it means Heap[0] is the nth farthest word until now
             Heap.pop():
             Heap.insert(word)

- Babak

需要实现“close”。 - fuzzyTew

4

这只是一个用于保存单词的数据结构，它并没有解释有关计算相似性的内容。 - Pablo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Oleh Prypin · Accepted Answer

167

使用difflib.get_close_matches。

>>> words = ['hello', 'Hallo', 'hi', 'house', 'key', 'screen', 'hallo', 'question', 'format']
>>> difflib.get_close_matches('Hello', words)
['hello', 'Hallo', 'hallo']

请查看文档，因为该函数默认返回3个或更少的最接近匹配项。

- Oleh Prypin

15

简单说明一下：difflib.get_close_matches("Hallo", words, len(words), 0) 可以列出所有匹配项 :) - Niklas B.

6

可以使用Levenshtein差异进行比较。这里有一个很好的Python实现：http://pypi.python.org/pypi/python-Levenshtein/ - Maksym Polshcha

2

difflib可以工作，但速度非常慢（单词列表中包含630000+个单词）。因此，每次搜索最接近的单词需要花费5到7秒钟来检查列表！ - Laura

11

@Laura，"慢"和"花费时间长"是有区别的。7秒可能是很长的时间，但不一定是慢的表现。 - Peter Wood

4

有人能指出difflib背后的算法吗？我很想了解它。 - MAC

显示剩余3条评论