计算 Levenshtein 距离的最有效方法

32
我刚刚实现了一个最佳匹配文件搜索算法,用于在字典中找到与字符串最相似的匹配项。在分析我的代码后,我发现绝大部分时间都是用于计算查询和可能结果之间的距离。目前我正在实现使用二维数组计算Levenshtein距离的算法,这使得实现成为O(n^2)操作。我希望有人能够建议更快的执行同样功能的方法。
以下是我的实现代码:
public int calculate(String root, String query)
{
  int arr[][] = new int[root.length() + 2][query.length() + 2];

  for (int i = 2; i < root.length() + 2; i++)
  {
    arr[i][0] = (int) root.charAt(i - 2);
    arr[i][1] = (i - 1);
  }

  for (int i = 2; i < query.length() + 2; i++)
  {
    arr[0][i] = (int) query.charAt(i - 2);
    arr[1][i] = (i - 1);
  }

  for (int i = 2; i < root.length() + 2; i++)
  {
    for (int j = 2; j < query.length() + 2; j++)
    {
      int diff = 0;
      if (arr[0][j] != arr[i][0])
      {
        diff = 1;
      }
      arr[i][j] = min((arr[i - 1][j] + 1), (arr[i][j - 1] + 1), (arr[i - 1][j - 1] + diff));
    }
  }
  return arr[root.length() + 1][query.length() + 1];
}

public int min(int n1, int n2, int n3)
{
  return (int) Math.min(n1, Math.min(n2, n3));
}

是的,其他部分已经足够高效了。我对我的代码进行了分析,发现瓶颈在于计算Levenshtein距离,因此我正在尝试优化这一部分。我正在实现维基百科文章中提到的改进方法,并将其与VP树的实现进行比较,以确定哪种方法更加高效。 - efficiencyIsBliss
4
使用一个二维数组来计算两个序列之间的Levenshtein距离,虽然可以实现,但时间复杂度为O(n^2)。不论你使用多少内存,没有限制的情况下,计算Levenshtein距离已经是一个O(n^2)的操作。而使用一个二维数组只会减慢速度并浪费内存,只需要O(n)的内存即可。 - John Machin
@John Machin 我知道这很古老,但你能提供一个示例或一些链接,展示如何实现那个O(n)空间解决方案吗? - celavek
1
@celavek 这个实现声称具有O(n)的空间复杂度。 - Andrea Reina
6个回答

28

Levenshtein距离的维基百科条目提供了有用的优化计算建议--在你的情况下,最适用的一个建议是如果你可以将兴趣的最大距离设定一个上限k(超过这个距离的所有内容都可以视为无穷大!),那么你可以将计算复杂度从O(n平方)降低到O(n乘以k)(基本上是通过一旦最小可能距离变成> k就放弃来实现的)。

由于你正在寻找最接近的匹配,因此可以逐步减小k到迄今为止找到的最佳匹配距离--这不会影响最坏情况的行为(因为匹配可能按距离递减排序,这意味着你永远不会更早地退出),但平均情况应该会得到改善。

我相信,如果你需要获得相当大的性能提升,可能必须接受一些强烈妥协,即计算更近似的距离(因此获得“相当不错的匹配”而不一定是最优匹配)。


1
我刚刚也发现了同样的问题,正要在这里发布。谢谢! - efficiencyIsBliss
你可以使用任何“近似距离”度量标准来创建一小组匹配项,然后使用真正的Levenshtein算法对它们进行排名。 - kibibu
@Alex:如果OP想要找到所有“可能”的匹配项,那么也可以进行优化,其中“可能”被定义为例如distance(query,candidate)/max(len(query),len(candidate)) < some_threshold_fraction - John Machin
@Dharmesh,我看到你在新问题上得到了很好的答案,希望你现在对于要计算什么以及如何计算的疑虑已经消除,不需要我再提供进一步的输入了。 - Alex Martelli
多亏了你的想法,我刚刚为apache-commons-lang做出了贡献,他们居然还没有这样做?https://github.com/apache/commons-lang/pull/118 - jontejj
显示剩余7条评论

8

10
我知道这是一个旧的帖子,但你让我有一瞬间感到困惑。你说的n不是同一个!在原帖中,n是单词长度,我们关心计算两个单词之间距离所需的时间。而对于你来说,n是字典中的单词数,而n log n是调用距离函数的次数。如果结合起来,公式为 W k D log D,其中D是词典大小,W是单词长度,k是距离阈值。 - Antoine
截至2023年1月1日,“加速Levenshtein”链接已经失效,我没有找到一个好的解决方法。链接的Python代码展示了使用VP-Tree解决拼写检查问题的示例。它使用Levenshtein距离作为将函数应用于VP-Tree节点的一种方式。Levenshtein距离仍然是O(m*n),其中m和n是要比较的单词的大小。 - hlongmore

4

我修改了在这个帖子中找到的Levenshtein距离VBA函数,使用一个一维数组。 它执行得更快。

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)

Public Function LevenshteinDistance2(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long 'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long, ssL As Long, cost As Long 'loop counters, loop step, loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long 'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long 'Length of S1 + 1, Length of S1 + 2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j        'setup array positions 0,1,2,3,4,...

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If Mid$(s1, i Mod ssL, 1) <> Mid$(s2, j, 1) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost

        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance2 = D(LD)
End Function

我已经测试了这个函数,使用长度为11,304的字符串's1'和长度为5,665的字符串's2'(进行了超过6400万次字符比较)。使用上述单维度版本的函数,在我的机器上执行时间约为24秒。在上面链接中提到的原始双维度函数对于相同的字符串需要约37秒。我进一步优化了单维度函数,如下所示,对于相同的字符串只需要约10秒。

'Calculate the Levenshtein Distance between two strings (the number of insertions,
'deletions, and substitutions needed to transform the first string into the second)
Public Function LevenshteinDistance(ByRef s1 As String, ByRef s2 As String) As Long
Dim L1 As Long, L2 As Long, D() As Long, LD As Long         'Length of input strings and distance matrix
Dim i As Long, j As Long, ss2 As Long                       'loop counters, loop step
Dim ssL As Long, cost As Long                               'loop start, and cost of substitution for current letter
Dim cI As Long, cD As Long, cS As Long                      'cost of next Insertion, Deletion and Substitution
Dim L1p1 As Long, L1p2 As Long                              'Length of S1 + 1, Length of S1 + 2
Dim sss1() As String, sss2() As String                      'Character arrays for string S1 & S2

L1 = Len(s1): L2 = Len(s2)
L1p1 = L1 + 1
L1p2 = L1 + 2
LD = (((L1 + 1) * (L2 + 1))) - 1
ReDim D(0 To LD)
ss2 = L1 + 1

For i = 0 To L1 Step 1: D(i) = i: Next i                    'setup array positions 0,1,2,3,4,...
For j = 0 To LD Step ss2: D(j) = j / ss2: Next j            'setup array positions 0,1,2,3,4,...

ReDim sss1(1 To L1)                                         'Size character array S1
ReDim sss2(1 To L2)                                         'Size character array S2
For i = 1 To L1 Step 1: sss1(i) = Mid$(s1, i, 1): Next i    'Fill S1 character array
For i = 1 To L2 Step 1: sss2(i) = Mid$(s2, i, 1): Next i    'Fill S2 character array

For j = 1 To L2
    ssL = (L1 + 1) * j
    For i = (ssL + 1) To (ssL + L1)
        If sss1(i Mod ssL) <> sss2(j) Then cost = 1 Else cost = 0
        cI = D(i - 1) + 1
        cD = D(i - L1p1) + 1
        cS = D(i - L1p2) + cost
        If cI <= cD Then 'Insertion or Substitution
            If cI <= cS Then D(i) = cI Else D(i) = cS
        Else 'Deletion or Substitution
            If cD <= cS Then D(i) = cD Else D(i) = cS
        End If
    Next i
Next j

LevenshteinDistance = D(LD)
End Function

3

Commons-lang有一个非常快速的实现。请查看http://web.archive.org/web/20120526085419/http://www.merriampark.com/ldjava.htm

以下是我将其翻译为Scala的结果:

// The code below is based on code from the Apache Commons lang project.
/*
 * Licensed to the Apache Software Foundation (ASF) under one or more
 * contributor license agreements. See the NOTICE file distributed with this
 * work for additional information regarding copyright ownership. The ASF
 * licenses this file to You under the Apache License, Version 2.0 (the
 * "License"); you may not use this file except in compliance with the
 * License. You may obtain a copy of the License at
 * 
 * http://www.apache.org/licenses/LICENSE-2.0
 * 
 * Unless required by applicable law or agreed to in writing, software
 * distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
 * WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. See the
 * License for the specific language governing permissions and limitations
 * under the License.
 */
/**
* assert(levenshtein("algorithm", "altruistic")==6)
* assert(levenshtein("1638452297", "444488444")==9)
* assert(levenshtein("", "") == 0)
* assert(levenshtein("", "a") == 1)
* assert(levenshtein("aaapppp", "") == 7)
* assert(levenshtein("frog", "fog") == 1)
* assert(levenshtein("fly", "ant") == 3)
* assert(levenshtein("elephant", "hippo") == 7)
* assert(levenshtein("hippo", "elephant") == 7)
* assert(levenshtein("hippo", "zzzzzzzz") == 8)
* assert(levenshtein("hello", "hallo") == 1)
*
*/
def levenshtein(s: CharSequence, t: CharSequence, max: Int = Int.MaxValue) = {
import scala.annotation.tailrec
def impl(s: CharSequence, t: CharSequence, n: Int, m: Int) = {
  // Inside impl n <= m!
  val p = new Array[Int](n + 1) // 'previous' cost array, horizontally
  val d = new Array[Int](n + 1) // cost array, horizontally

  @tailrec def fillP(i: Int) {
    p(i) = i
    if (i < n) fillP(i + 1)
  }
  fillP(0)

  @tailrec def eachJ(j: Int, t_j: Char, d: Array[Int], p: Array[Int]): Int = {
    d(0) = j
    @tailrec def eachI(i: Int) {
      val a = d(i - 1) + 1
      val b = p(i) + 1
      d(i) = if (a < b) a else {
        val c = if (s.charAt(i - 1) == t_j) p(i - 1) else p(i - 1) + 1
        if (b < c) b else c
      }
      if (i < n)
        eachI(i + 1)
    }
    eachI(1)

    if (j < m)
      eachJ(j + 1, t.charAt(j), p, d)
    else
      d(n)
  }
  eachJ(1, t.charAt(0), d, p)
}

val n = s.length
val m = t.length
if (n == 0) m else if (m == 0) n else {
  if (n > m) impl(t, s, m, n) else impl(s, t, n, m)
}

}


3
维基百科文章讨论了您的算法以及各种改进。然而,至少在一般情况下,O(n^2) 是最好的结果。
但是,如果您可以限制问题的范围(例如,如果您只对小于d的距离感兴趣,则复杂度为O(dn) - 如果一个匹配的距离接近字符串长度,则可能不太有趣),则有一些改进方法。看看是否可以利用问题的具体特点...

2
我知道这已经很晚了,但它与讨论相关。
如其他人所述,如果您只想检查两个字符串之间的编辑距离是否在某个阈值k内,可以将时间复杂度降至O(kn)。更精确的表达是O((2k+1)n)。您需要获取一个跨越对角线单元格的k个单元格宽度的条形带(条形宽度为2k + 1),并计算位于此条形带上的单元格的值。
有趣的是,Li等人进行了改进,将其进一步降低到O((k + 1) n)。

请解释这是如何发生的。 - sudeepdino008
你可以查看这篇论文,我的回答中有链接。 - Shashwat Mishra
2
那篇论文代表了2012年的前沿技术。现在已经是2016年了。是否有任何实际的参考实现该工作,或者新的研究已经取代了它? - i336_

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接