如何衡量两个字符串之间的相似度？

Question

如何衡量两个字符串之间的相似度？

55

给定两个字符串 text1 和 text2：

public SOMEUSABLERETURNTYPE Compare(string text1, string text2)
{
     // DO SOMETHING HERE TO COMPARE
}

示例：

第一个字符串：StackOverflow

第二个字符串：StaqOverflow

返回：相似度为91％

返回结果可以是％或类似的东西。
第一个字符串：The simple text test

第二个字符串：The complex text test

返回：这些值可以被认为是相等的

有什么想法吗？最好的方法是什么？

- Zanoni

9

为什么您认为示例2中的两个字符串应该相等？ - Sinan Ünür

我有什么遗漏吗？除了第一个例子可能暗示了音韵相似性之外，原帖作者是否表达了他关心的是音韵而不是字符？第二个例子显然没有。 - Kevin

我猜“相似度”和“音形码”是最接近的，但它们是不同的东西。“相似度”验证需要使用“音形码”算法和“相似度”算法才能正确验证文本。 - Zanoni

@kcrumley：第二个例子只是假设。对于每个找到的单词，具有一定相似性的字符串可以被视为相似。 - Zanoni

12个回答

27

Levenshtein distance（编辑距离）可能是你在寻找的内容。

- LiraNuna

这是一个很好的编写示例，一定会喜欢DotNetPerls。http://www.dotnetperls.com/levenshtein - jamesbar2

15

以下是我为正在开发的项目编写的一些代码。我需要知道字符串的相似性比率以及基于单词的字符串相似性比率。最后一个问题，我想知道最小字符串的单词相似度比率(因此，如果所有单词都存在且匹配较大的字符串，则结果将为100%)，以及较大字符串的单词相似度比率(我称其为RealWordsRatio)。我使用Levenshtein算法来找到距离。代码未经过优化，但目前它能正常工作。希望你觉得它有用。

public static int Compute(string s, string t)
    {
        int n = s.Length;
        int m = t.Length;
        int[,] d = new int[n + 1, m + 1];

        // Step 1
        if (n == 0)
        {
            return m;
        }

        if (m == 0)
        {
            return n;
        }

        // Step 2
        for (int i = 0; i <= n; d[i, 0] = i++)
        {
        }

        for (int j = 0; j <= m; d[0, j] = j++)
        {
        }

        // Step 3
        for (int i = 1; i <= n; i++)
        {
            //Step 4
            for (int j = 1; j <= m; j++)
            {
                // Step 5
                int cost = (t[j - 1] == s[i - 1]) ? 0 : 1;

                // Step 6
                d[i, j] = Math.Min(
                    Math.Min(d[i - 1, j] + 1, d[i, j - 1] + 1),
                    d[i - 1, j - 1] + cost);
            }
        }
        // Step 7
        return d[n, m];
    }

double GetSimilarityRatio(String FullString1, String FullString2, out double WordsRatio, out double RealWordsRatio)
    {
        double theResult = 0;
        String[] Splitted1 = FullString1.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        String[] Splitted2 = FullString2.Split(new char[]{' '}, StringSplitOptions.RemoveEmptyEntries);
        if (Splitted1.Length < Splitted2.Length)
        {
            String[] Temp = Splitted2;
            Splitted2 = Splitted1;
            Splitted1 = Temp;
        }
        int[,] theScores = new int[Splitted1.Length, Splitted2.Length];//Keep the best scores for each word.0 is the best, 1000 is the starting.
        int[] BestWord = new int[Splitted1.Length];//Index to the best word of Splitted2 for the Splitted1.

        for (int loop = 0; loop < Splitted1.Length; loop++) 
        {
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++) theScores[loop, loop1] = 1000;
            BestWord[loop] = -1;
        }
        int WordsMatched = 0;
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            String String1 = Splitted1[loop];
            for (int loop1 = 0; loop1 < Splitted2.Length; loop1++)
            {
                String String2 = Splitted2[loop1];
                int LevenshteinDistance = Compute(String1, String2);
                theScores[loop, loop1] = LevenshteinDistance;
                if (BestWord[loop] == -1 || theScores[loop, BestWord[loop]] > LevenshteinDistance) BestWord[loop] = loop1;
            }
        }

        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) continue;
            for (int loop1 = loop + 1; loop1 < Splitted1.Length; loop1++)
            {
                if (theScores[loop1, BestWord[loop1]] == 1000) continue;//the worst score available, so there are no more words left
                if (BestWord[loop] == BestWord[loop1])//2 words have the same best word
                {
                    //The first in order has the advantage of keeping the word in equality
                    if (theScores[loop, BestWord[loop]] <= theScores[loop1, BestWord[loop1]])
                    {
                        theScores[loop1, BestWord[loop1]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop1, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop1, loop2];
                            }
                        }
                        BestWord[loop1] = CurrentBest;
                    }
                    else//the latter has a better score
                    {
                        theScores[loop, BestWord[loop]] = 1000;
                        int CurrentBest = -1;
                        int CurrentScore = 1000;
                        for (int loop2 = 0; loop2 < Splitted2.Length; loop2++)
                        {
                            //Find next bestword
                            if (CurrentBest == -1 || CurrentScore > theScores[loop, loop2])
                            {
                                CurrentBest = loop2;
                                CurrentScore = theScores[loop, loop2];
                            }
                        }
                        BestWord[loop] = CurrentBest;
                    }

                    loop = -1;
                    break;//recalculate all
                }
            }
        }
        for (int loop = 0; loop < Splitted1.Length; loop++)
        {
            if (theScores[loop, BestWord[loop]] == 1000) theResult += Splitted1[loop].Length;//All words without a score for best word are max failures
            else
            {
                theResult += theScores[loop, BestWord[loop]];
                if (theScores[loop, BestWord[loop]] == 0) WordsMatched++;
            }
        }
        int theLength = (FullString1.Replace(" ", "").Length > FullString2.Replace(" ", "").Length) ? FullString1.Replace(" ", "").Length : FullString2.Replace(" ", "").Length;
        if(theResult > theLength) theResult = theLength;
        theResult = (1 - (theResult / theLength)) * 100;
        WordsRatio = ((double)WordsMatched / (double)Splitted2.Length) * 100;
        RealWordsRatio = ((double)WordsMatched / (double)Splitted1.Length) * 100;
        return theResult;
    }

- ThunderGr

5

我之前写了一个用C#实现的双重音标算法，你会发现它比Soundex等算法要好得多。

Levenshtein距离也被提出过，对于许多用途来说，它是一个很棒的算法，但是它并不真正执行语音匹配；有时候它只是看起来像这样，因为语音上相似的单词通常也拼写相似。我进行了各种模糊匹配算法的分析，你可能也会觉得它有用。

- anelson

4

为了解决“发音相似”的问题，您可能需要考虑使用像双重隐喻或soundex这样的语音算法进行编码。我不知道在音素编码字符串上计算Levenshtein距离是否有益，但也许是一个可能性。或者，您可以采用这样的启发式方法：将字符串中的每个单词转换为其编码形式，并删除同时出现在两个字符串中的任何单词，并在计算Levenshtein距离之前用单个表示替换它们。

- bdk

Soundex算法被医学界用于警告类似的姓氏。它包含在许多标准库中。 - slypete

1

Metaphone比Soundex要优秀得多。 - Dour High Arch

3

你可以查找字符串 "distances"，例如莱文斯坦距离（Levenshtein distance）。

- schnaader

3

Perl模块Text::Phonetic包含多种算法实现。

- Sinan Ünür

2

Jeff Atwood写过一个寻找类似解决方案的文章，用于确定维基帖子的作者，这可能有助于您缩小搜索范围。

- John Sheehan

2

如果您要比较SQL数据库中的值，可以使用SOUNDEX函数。如果您查询SOUNDEX和C#以及VB的谷歌搜索，一些人已经为此编写了类似的函数。

- Rob

1

Metaphone 3是Metaphone算法的第三代。与双重Metaphone的89％相比，在最常见的英语单词、名字和北美熟悉的非英语单词数据库测试中，它将语音编码的准确性提高到98％。这为美式发音提供了极其可靠的语音编码。Metaphone 3由Lawrence Philips设计和开发，他也设计和开发了原始的Metaphone和Double Metaphone算法。

- bjan

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jon Skeet · Accepted Answer

有多种不同的方法可以做到这一点。请查看 Wikipedia "String similarity measures" 页面，以获取指向其他算法页面的链接。

我认为这些算法中没有任何一个考虑了音频，所以 "staq overflow" 和 "stack overflow" 与 "staw overflow" 相似程度相同，尽管前者在发音上更相似。

我刚刚发现另一个页面提供了更多选择......特别是 Soundex 算法 (维基百科) 可能更接近您想要的结果。