Java中的相似字符串比较

145

我想比较几个字符串,找出它们中最相似的那些。我想知道是否有任何库、方法或最佳实践可以返回哪些字符串更相似。例如:

  • "The quick fox jumped" -> "The fox jumped"
  • "The quick fox jumped" -> "The fox"

这种比较会返回第一个比第二个更相似。

我想我需要一些类似于:

double similarityIndex(String s1, String s2)

有这样的东西吗?

编辑:为什么我要这样做?我正在编写一个脚本,用于比较MS Project文件的输出和处理任务的某个旧系统的输出。由于旧系统具有非常有限的字段宽度,当添加值时,描述会被缩写。我想找到一种半自动化的方法,找出MS Project中与系统上的条目相似的条目,以便获取生成的密钥。它有一些缺点,因为仍需要手动检查,但可以节省很多工作。

12个回答

191

在许多库中常用的计算两个字符串相似度的方法是以0%-100%的方式来衡量,即测量将较长的字符串更改为较短的字符串所需的变化量(以%表示):

/**
 * Calculates the similarity (a number within 0 and 1) between two strings.
 */
public static double similarity(String s1, String s2) {
  String longer = s1, shorter = s2;
  if (s1.length() < s2.length()) { // longer should always have greater length
    longer = s2; shorter = s1;
  }
  int longerLength = longer.length();
  if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
  return (longerLength - editDistance(longer, shorter)) / (double) longerLength;
}
// you can use StringUtils.getLevenshteinDistance() as the editDistance() function
// full copy-paste working code is below


计算editDistance():

上面的editDistance()函数旨在计算两个字符串之间的编辑距离。有几种实现可以完成此步骤,每个实现可能更适合特定的场景。最常见的是Levenshtein距离算法,我们将在下面的示例中使用它(对于非常大的字符串,其他算法可能会表现得更好)。

以下是计算编辑距离的两个选项:


工作示例:

在此处查看在线演示。

public class StringSimilarity {

  /**
   * Calculates the similarity (a number within 0 and 1) between two strings.
   */
  public static double similarity(String s1, String s2) {
    String longer = s1, shorter = s2;
    if (s1.length() < s2.length()) { // longer should always have greater length
      longer = s2; shorter = s1;
    }
    int longerLength = longer.length();
    if (longerLength == 0) { return 1.0; /* both strings are zero length */ }
    /* // If you have Apache Commons Text, you can use it to calculate the edit distance:
    LevenshteinDistance levenshteinDistance = new LevenshteinDistance();
    return (longerLength - levenshteinDistance.apply(longer, shorter)) / (double) longerLength; */
    return (longerLength - editDistance(longer, shorter)) / (double) longerLength;

  }

  // Example implementation of the Levenshtein Edit Distance
  // See http://rosettacode.org/wiki/Levenshtein_distance#Java
  public static int editDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
      int lastValue = i;
      for (int j = 0; j <= s2.length(); j++) {
        if (i == 0)
          costs[j] = j;
        else {
          if (j > 0) {
            int newValue = costs[j - 1];
            if (s1.charAt(i - 1) != s2.charAt(j - 1))
              newValue = Math.min(Math.min(newValue, lastValue),
                  costs[j]) + 1;
            costs[j - 1] = lastValue;
            lastValue = newValue;
          }
        }
      }
      if (i > 0)
        costs[s2.length()] = lastValue;
    }
    return costs[s2.length()];
  }

  public static void printSimilarity(String s, String t) {
    System.out.println(String.format(
      "%.3f is the similarity between \"%s\" and \"%s\"", similarity(s, t), s, t));
  }

  public static void main(String[] args) {
    printSimilarity("", "");
    printSimilarity("1234567890", "1");
    printSimilarity("1234567890", "123");
    printSimilarity("1234567890", "1234567");
    printSimilarity("1234567890", "1234567890");
    printSimilarity("1234567890", "1234567980");
    printSimilarity("47/2010", "472010");
    printSimilarity("47/2010", "472011");
    printSimilarity("47/2010", "AB.CDEF");
    printSimilarity("47/2010", "4B.CDEFG");
    printSimilarity("47/2010", "AB.CDEFG");
    printSimilarity("The quick fox jumped", "The fox jumped");
    printSimilarity("The quick fox jumped", "The fox");
    printSimilarity("kitten", "sitting");
  }

}

输出:

1.000 is the similarity between "" and ""
0.100 is the similarity between "1234567890" and "1"
0.300 is the similarity between "1234567890" and "123"
0.700 is the similarity between "1234567890" and "1234567"
1.000 is the similarity between "1234567890" and "1234567890"
0.800 is the similarity between "1234567890" and "1234567980"
0.857 is the similarity between "47/2010" and "472010"
0.714 is the similarity between "47/2010" and "472011"
0.000 is the similarity between "47/2010" and "AB.CDEF"
0.125 is the similarity between "47/2010" and "4B.CDEFG"
0.000 is the similarity between "47/2010" and "AB.CDEFG"
0.700 is the similarity between "The quick fox jumped" and "The fox jumped"
0.350 is the similarity between "The quick fox jumped" and "The fox"
0.571 is the similarity between "kitten" and "sitting"

15
"Levenshtein距离方法可以在org.apache.commons.lang3.StringUtils中找到。" - Cleankod
1
@Cleankod 现在它是 commons-text 的一部分:https://commons.apache.org/proper/commons-text/javadocs/api-release/org/apache/commons/text/similarity/LevenshteinDistance.html - Luiz

91

20
Simmetrics网站似乎不再活跃。不过,我在sourceforge上找到了代码:http://sourceforge.net/projects/simmetrics/。感谢指引。 - Michael Merchant
7
“你可以查看这个链接”的链接已经损坏。 - Kiril
2
这就是为什么Michael Merchant在上面发布了正确的链接。 - emilyk
2
Simmetrics在sourceforge上的jar包有点过时了,https://github.com/mpkorstanje/simmetrics是更新的GitHub页面,其中包含Maven构件。 - tom91136
补充@MichaelMerchant的评论,该项目也可在github上找到。虽然那里也不是很活跃,但比sourceforge更新一些。 - Ghurdyl

16

我将 Levenshtein距离算法 翻译成了 JavaScript:

String.prototype.LevenshteinDistance = function (s2) {
    var array = new Array(this.length + 1);
    for (var i = 0; i < this.length + 1; i++)
        array[i] = new Array(s2.length + 1);

    for (var i = 0; i < this.length + 1; i++)
        array[i][0] = i;
    for (var j = 0; j < s2.length + 1; j++)
        array[0][j] = j;

    for (var i = 1; i < this.length + 1; i++) {
        for (var j = 1; j < s2.length + 1; j++) {
            if (this[i - 1] == s2[j - 1]) array[i][j] = array[i - 1][j - 1];
            else {
                array[i][j] = Math.min(array[i][j - 1] + 1, array[i - 1][j] + 1);
                array[i][j] = Math.min(array[i][j], array[i - 1][j - 1] + 1);
            }
        }
    }
    return array[this.length][s2.length];
};

14

确实有很多字符串相似度度量方法:

  • Levenshtein编辑距离;
  • Damerau-Levenshtein距离;
  • Jaro-Winkler相似度;
  • 最长公共子序列编辑距离;
  • Q-Gram (Ukkonen);
  • n-Gram距离 (Kondrak);
  • Jaccard指数;
  • Sorensen-Dice系数;
  • 余弦相似度;
  • ...

您可以在这里找到它们的解释和Java实现: https://github.com/tdebatty/java-string-similarity


14

3
截至2017年10月,链接方法已被弃用。请改用commons文本库中的LevenshteinDistanceFuzzyScore类。 - vatbub

11

2
Levenshtein对于少量字符串非常好用,但在大量字符串的比较时无法扩展。 - spender
我在Java中使用了Levenshtein,并取得了一些成功。我还没有对大型列表进行比较,所以可能会有性能问题。 另外,这个算法有点简单,可以通过对较短的单词(如3或4个字符)设置更高的阈值来进行调整,因为这些单词往往被认为比应该更相似(例如从cat到dog只需3次编辑)。 请注意,下面提到的Edit Distances基本上是相同的概念 - Levenshtein是编辑距离的一种具体实现方式。 - Rhubarb
这是一篇文章,展示了如何将Levenshtein算法与高效的SQL查询结合起来使用:http://literatejava.com/sql/fuzzy-string-search-sql/ - Thomas W

5

感谢第一个回答者,我认为computeEditDistance(s1,s2)有两种计算方式。由于其时间消耗较高,所以决定改进代码的性能。因此:

public class LevenshteinDistance {

public static int computeEditDistance(String s1, String s2) {
    s1 = s1.toLowerCase();
    s2 = s2.toLowerCase();

    int[] costs = new int[s2.length() + 1];
    for (int i = 0; i <= s1.length(); i++) {
        int lastValue = i;
        for (int j = 0; j <= s2.length(); j++) {
            if (i == 0) {
                costs[j] = j;
            } else {
                if (j > 0) {
                    int newValue = costs[j - 1];
                    if (s1.charAt(i - 1) != s2.charAt(j - 1)) {
                        newValue = Math.min(Math.min(newValue, lastValue),
                                costs[j]) + 1;
                    }
                    costs[j - 1] = lastValue;
                    lastValue = newValue;
                }
            }
        }
        if (i > 0) {
            costs[s2.length()] = lastValue;
        }
    }
    return costs[s2.length()];
}

public static void printDistance(String s1, String s2) {
    double similarityOfStrings = 0.0;
    int editDistance = 0;
    if (s1.length() < s2.length()) { // s1 should always be bigger
        String swap = s1;
        s1 = s2;
        s2 = swap;
    }
    int bigLen = s1.length();
    editDistance = computeEditDistance(s1, s2);
    if (bigLen == 0) {
        similarityOfStrings = 1.0; /* both strings are zero length */
    } else {
        similarityOfStrings = (bigLen - editDistance) / (double) bigLen;
    }
    //////////////////////////
    //System.out.println(s1 + "-->" + s2 + ": " +
      //      editDistance + " (" + similarityOfStrings + ")");
    System.out.println(editDistance + " (" + similarityOfStrings + ")");
}

public static void main(String[] args) {
    printDistance("", "");
    printDistance("1234567890", "1");
    printDistance("1234567890", "12");
    printDistance("1234567890", "123");
    printDistance("1234567890", "1234");
    printDistance("1234567890", "12345");
    printDistance("1234567890", "123456");
    printDistance("1234567890", "1234567");
    printDistance("1234567890", "12345678");
    printDistance("1234567890", "123456789");
    printDistance("1234567890", "1234567890");
    printDistance("1234567890", "1234567980");

    printDistance("47/2010", "472010");
    printDistance("47/2010", "472011");

    printDistance("47/2010", "AB.CDEF");
    printDistance("47/2010", "4B.CDEFG");
    printDistance("47/2010", "AB.CDEFG");

    printDistance("The quick fox jumped", "The fox jumped");
    printDistance("The quick fox jumped", "The fox");
    printDistance("The quick fox jumped",
            "The quick fox jumped off the balcany");
    printDistance("kitten", "sitting");
    printDistance("rosettacode", "raisethysword");
    printDistance(new StringBuilder("rosettacode").reverse().toString(),
            new StringBuilder("raisethysword").reverse().toString());
    for (int i = 1; i < args.length; i += 2) {
        printDistance(args[i - 1], args[i]);
    }


 }
}

3

3

通常使用编辑距离测量来完成此操作。搜索“编辑距离java”会返回许多库,例如这个


3

如果你的字符串变成文档,那么它听起来像是抄袭检测器。也许使用这个术语进行搜索会得到一些好的结果。

《编程集体智慧》有一章关于确定两个文档是否相似。代码是用Python编写的,但很干净且易于移植。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接