计算两个文本块之间百分比差异的算法

Question

计算两个文本块之间百分比差异的算法

pythonalgorithmstring

3

我一直在研究如何找到一种高效的解决方案。我研究了一些差异引擎（谷歌的diff-match-patch、Python的diff）以及一些最长公共链算法。

希望能得到你们的建议，有没有特别推荐的算法或库可以解决这个问题？

- colorfulgrayscale

2

看起来你已经回答了自己的问题。如果你正在研究“最长公共子序列”和“diff”，那么你知道这是一个经典问题，也是NP难问题。在这里没有人能比几代计算机科学家做得更好。http://en.wikipedia.org/wiki/Longest_common_subsequence_problem - msw

DNA匹配怎么样？http://www.springerlink.com/content/k2uq2r0037105466/ - Hamish Grubijan

1

你想要什么，“百分比差异”（如何定义？）还是“最长公共子串”（子串可能等于“链”？？？），还是两者都要？请给出一个例子，例如s1 =“thequickbrownfox”，s2 =“thequickvrownfox”，你期望得到什么输出？你将如何使用结果？你做了哪些研究，目前得出了什么结论？这是作业吗？ - John Machin

@s1 = "thequickbrownfox" @s2 = "thequickbrownfoxes"输出，相似度为98％（差异= 2％）不，这不是为了作业，我正在努力将其集成到维基中。 - colorfulgrayscale

1

@colorful：通过向16个字符的字符串添加2个字符，如何得到差异= 2％？我再问一遍：你如何定义百分比差异？如果你需要帮助而不是胡闹，请回答其他问题。 - John Machin

4个回答

2

除了difflib和其他常见的子序列库外，如果是自然语言文本，您可以考虑使用词干提取技术，将单词规范化为其根形式。您可以在自然语言工具包（http://www.nltk.org/）库中找到几个实现。您还可以通过使用N-Grams（http://en.wikipedia.org/wiki/N-gram）更语义化地比较自然语言文本的块。

- gilesc

1

最长公共链？也许这个链接可以帮到你：http://en.wikipedia.org/wiki/Longest_common_subsequence_problem

- Aryabhatta

0

另一个有趣的领域可能是Levenshtein距离，在这里有描述。

- Grembo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alex Martelli · Accepted Answer

我不知道“最长公共[[链？子串？]]”与“百分比差异”有什么关系，尤其是在看到评论后，您预计两个字符串之间的非常小的％差异，这两个字符串仅在中间相差一个字符（因此它们的最长公共子串约为字符串长度的一半）。

忽略“最长公共”的奇怪之处，并将“百分比差异”定义为字符串之间的编辑距离除以最大长度（当然要乘以100;-），那么问题来了：

def levenshtein_distance(first, second):
    """Find the Levenshtein distance between two strings."""
    if len(first) > len(second):
        first, second = second, first
    if len(second) == 0:
        return len(first)
    first_length = len(first) + 1
    second_length = len(second) + 1
    distance_matrix = [[0] * second_length for x in range(first_length)]
    for i in range(first_length):
       distance_matrix[i][0] = i
    for j in range(second_length):
       distance_matrix[0][j]=j
    for i in xrange(1, first_length):
        for j in range(1, second_length):
            deletion = distance_matrix[i-1][j] + 1
            insertion = distance_matrix[i][j-1] + 1
            substitution = distance_matrix[i-1][j-1]
            if first[i-1] != second[j-1]:
                substitution += 1
            distance_matrix[i][j] = min(insertion, deletion, substitution)
    return distance_matrix[first_length-1][second_length-1]

def percent_diff(first, second):
    return 100*levenshtein_distance(a, b) / float(max(len(a), len(b)))

a = "the quick brown fox"
b = "the quick vrown fox"
print '%.2f' % percent_diff(a, b)

Levenshtein函数来自 Stavros的博客。在这种情况下，结果为5.26％（差异百分比）。