寻找两个字符串之间的相似度指标。

Question

寻找两个字符串之间的相似度指标。

474

如何在Python中获得字符串相似度的概率？

我想要一个十进制值，如0.9（表示90%），最好使用标准Python和库。

例如：

similar("Apple","Appel") #would have a high prob.

similar("Apple","Mango") #would have a lower prob.

- tenstar

12

我认为“概率”可能不是这里恰当的术语。无论如何，请参见https://dev59.com/GnRB5IYBdhLWcg3wQFLu。 - NPE

5

你要找的词是“比率”，而不是“概率”。 - Inbar Rose

3

请查看汉明距离。 - Diana

5

短语是“相似度量”，但有多个相似度量（Jaccard，Cosine，Hamming，Levenshein等），因此您需要指定哪一个。具体而言，您想要字符串之间的相似度量; @hbprotoss列出了几个。 - smci

我喜欢来自https://dev59.com/03RB5IYBdhLWcg3wWF8H的“bigrams”。 - MarkHu

16个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- DRV · Answer 1

Textdistance：

TextDistance是一个Python库，可以通过多种算法比较两个或多个序列之间的距离。它具有Textdistance。

30多个算法
纯Python实现
简单易用
比较超过两个序列
一些算法在同一个类中具有多个实现。
可选择使用numpy以达到最大速度。

示例1：

import textdistance
textdistance.hamming('test', 'text')

输出:

1

例子2：

import textdistance

textdistance.hamming.normalized_similarity('test', 'text')

输出：

0.75

感谢和祝福！

- George Pipis · Answer 2

如上所述，有许多指标可用于定义字符串之间的相似度和距离。我将通过展示使用Q-Grams的Jaccard相似度和编辑距离的例子来阐述我的看法。

相关库：

from nltk.metrics.distance import jaccard_distance
from nltk.util import ngrams
from nltk.metrics.distance  import edit_distance

Jaccard相似度

1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Appel', 2)))

然后我们得到：

0.33333333333333337

对于 苹果 和 芒果

1-jaccard_distance(set(ngrams('Apple', 2)), set(ngrams('Mango', 2)))

然后我们获得：

0.0

编辑距离

edit_distance('Apple', 'Appel')

然后我们得到：

最后，

edit_distance('Apple', 'Mango')

然后我们得到：

基于Q-Gram (q=2)的余弦相似度

另一种解决方案是使用 textdistance 库。下面提供一个余弦相似度的示例：

import textdistance
1-textdistance.Cosine(qval=2).distance('Apple', 'Appel')

我们得到：

0.5

- Alex Punnen · Answer 3

将Spacy NLP图书馆添加到混合中；

@profile
def main():
    str1= "Mar 31 09:08:41  The world is beautiful"
    str2= "Mar 31 19:08:42  Beautiful is the world"
    print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
    print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio()) 
    print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))

if __name__ == '__main__':

    #python3 -m spacy download en_core_web_sm
    #nlp = spacy.load("en_core_web_sm")
    nlp = spacy.load("en_core_web_md")
    main()

使用Robert Kern的line_profiler运行

kernprof -l -v ./python/loganalysis/testspacy.py

NLP Similarity= 0.9999999821467294
Diff lib similarity 0.5897435897435898
Jellyfish lib similarity 0.8561253561253562

然而现在的时代正在揭示真相。

Function: main at line 32

Line #      Hits         Time  Per Hit   % Time  Line Contents
==============================================================
    32                                           @profile
    33                                           def main():
    34         1          1.0      1.0      0.0      str1= "Mar 31 09:08:41  The world is beautiful"
    35         1          0.0      0.0      0.0      str2= "Mar 31 19:08:42  Beautiful is the world"
    36         1      43248.0  43248.0     99.1      print("NLP Similarity=",nlp(str1).similarity(nlp(str2)))
    37         1        375.0    375.0      0.9      print("Diff lib similarity",SequenceMatcher(None, str1, str2).ratio()) 
    38         1         30.0     30.0      0.1      print("Jellyfish lib similarity",jellyfish.jaro_distance(str1, str2))

- HCLivess · Answer 4

我为自己的目的编写了一个程序，它比difflib SequenceMatcher的quick_ratio()函数快2倍，同时提供类似的结果。其中a和b是字符串：

    score = 0
    for letters in enumerate(a):
        score = score + b.count(letters[1])

- David Emmanuel · Answer 5

这是我的想法：

import string

def match(a,b):
    a,b = a.lower(), b.lower()
    error = 0
    for i in string.ascii_lowercase:
            error += abs(a.count(i) - b.count(i))
    total = len(a) + len(b)
    return (total-error)/total

if __name__ == "__main__":
    print(match("pple inc", "Apple Inc."))

- Weilory · Answer 6

Python3.6+ = 无需导入库在大多数情况下运行良好

在Stack Overflow中，当您尝试添加标签或发布问题时，它会呈现出所有相关内容。这非常方便，正是我正在寻找的算法。因此，我编写了一个查询集相似度过滤器。

def compare(qs, ip):
    al = 2
    v = 0
    for ii, letter in enumerate(ip):
        if letter == qs[ii]:
            v += al
        else:
            ac = 0
            for jj in range(al):
                if ii - jj < 0 or ii + jj > len(qs) - 1: 
                    break
                elif letter == qs[ii - jj] or letter == qs[ii + jj]:
                    ac += jj
                    break
            v += ac
    return v


def getSimilarQuerySet(queryset, inp, length):
    return [k for tt, (k, v) in enumerate(reversed(sorted({it: compare(it, inp) for it in queryset}.items(), key=lambda item: item[1])))][:length]
        


if __name__ == "__main__":
    print(compare('apple', 'mongo'))
    # 0
    print(compare('apple', 'apple'))
    # 10
    print(compare('apple', 'appel'))
    # 7
    print(compare('dude', 'ud'))
    # 1
    print(compare('dude', 'du'))
    # 4
    print(compare('dude', 'dud'))
    # 6

    print(compare('apple', 'mongo'))
    # 2
    print(compare('apple', 'appel'))
    # 8

    print(getSimilarQuerySet(
        [
            "java",
            "jquery",
            "javascript",
            "jude",
            "aja",
        ], 
        "ja",
        2,
    ))
    # ['javascript', 'java']

说明

compare 接受两个字符串并返回一个正整数。
您可以编辑 compare 中的 al 变量，它表示我们需要搜索的范围有多大。它的工作原理是：迭代两个字符串，如果在相同的索引处找到相同的字符，则累加器将添加到最大值中。然后，在 allowed 的索引范围内搜索，如果匹配，则根据字母的距离将其添加到累加器中（距离越远，值越小）。
length 表示您想要作为结果的项数，即与输入字符串最相似的项数。