修改Levenshtein距离算法以避免计算所有距离

8

我正在开发一个模糊搜索实现,作为实现的一部分,我们使用了Apache的StringUtils.getLevenshteinDistance。目前,我们为我们的模糊搜索设定了特定的最大平均响应时间。通过各种增强和一些分析,我们发现计算Levenshtein距离所花费的时间最多。对于长度为三个或更多字符的搜索字符串,它占据了总时间的大约80-90%。

现在,我知道在这里可以做的有限,但是我在以前的SO问题和LD的维基百科链接中读到过,如果愿意将阈值限制为设置的最大距离,那么可以帮助控制算法所花费的时间,但我不确定如何确切地做到这一点。

如果我们只对小于阈值k的距离感兴趣,则在矩阵中计算宽度为2k + 1的对角线条纹就足够了。通过这种方式,算法可以在O(kl)时间内运行,其中l是最短字符串的长度。[3]

下面您将看到StringUtils中的原始LH代码。之后是我的修改。我试图基本上计算与i,j对角线相隔固定长度的距离(因此,在我的示例中,i,j对角线上方和下方的两个对角线)。但是,我已经做错了。例如,在最高对角线上,它总是会选择直接上方的单元格值,该值将为0。如果有人可以向我展示如何使其按照我所描述的功能正常工作,或者一些通用建议,那将不胜感激。

public static int getLevenshteinDistance(String s, String t) {
        if (s == null || t == null) {
            throw new IllegalArgumentException("Strings must not be null");
        }

        int n = s.length(); // length of s
        int m = t.length(); // length of t

        if (n == 0) {
            return m;
        } else if (m == 0) {
            return n;
        }

        if (n > m) {
            // swap the input strings to consume less memory
            String tmp = s;
            s = t;
            t = tmp;
            n = m;
            m = t.length();
        }

        int p[] = new int[n+1]; //'previous' cost array, horizontally
        int d[] = new int[n+1]; // cost array, horizontally
        int _d[]; //placeholder to assist in swapping p and d

        // indexes into strings s and t
        int i; // iterates through s
        int j; // iterates through t

        char t_j; // jth character of t

        int cost; // cost

        for (i = 0; i<=n; i++) {
            p[i] = i;
        }

        for (j = 1; j<=m; j++) {
            t_j = t.charAt(j-1);
            d[0] = j;

            for (i=1; i<=n; i++) {
                cost = s.charAt(i-1)==t_j ? 0 : 1;
                // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
            }

            // copy current distance counts to 'previous row' distance counts
            _d = p;
            p = d;
            d = _d;
        }

        // our last action in the above loop was to switch d and p, so p now 
        // actually has the most recent cost counts
        return p[n];
    }

我的修改(仅限于for循环):

  for (j = 1; j<=m; j++) {
        t_j = t.charAt(j-1);
        d[0] = j;

        int k = Math.max(j-2, 1);
        for (i = k; i <= Math.min(j+2, n); i++) {
            cost = s.charAt(i-1)==t_j ? 0 : 1;
            // minimum of cell to the left+1, to the top+1, diagonally left and up +cost
            d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  p[i-1]+cost);
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

刚刚想到一个办法,可以检查数值是否为零,然后忽略它或者用一个任意高的值来替换它。不过可能需要再仔细考虑一下。 - AHungerArtist
6个回答

5
实现窗口的问题在于处理每行第一个条目左侧和最后一个条目上方的值。一种方法是从1开始填写初始值,而不是0,然后忽略任何遇到的0。最终答案需要减去1。另一种方法是使用高值填充第一个条目左侧和最后一个条目上方的条目,以便最小检查永远不会选择它们。我前几天实现时选择了这种方式。
public static int levenshtein(String s, String t, int threshold) {
    int slen = s.length();
    int tlen = t.length();

    // swap so the smaller string is t; this reduces the memory usage
    // of our buffers
    if(tlen > slen) {
        String stmp = s;
        s = t;
        t = stmp;
        int itmp = slen;
        slen = tlen;
        tlen = itmp;
    }

    // p is the previous and d is the current distance array; dtmp is used in swaps
    int[] p = new int[tlen + 1];
    int[] d = new int[tlen + 1];
    int[] dtmp;

    // the values necessary for our threshold are written; the ones after
    // must be filled with large integers since the tailing member of the threshold 
    // window in the bottom array will run min across them
    int n = 0;
    for(; n < Math.min(p.length, threshold + 1); ++n)
        p[n] = n;
    Arrays.fill(p, n, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // this is the core of the Levenshtein edit distance algorithm
    // instead of actually building the matrix, two arrays are swapped back and forth
    // the threshold limits the amount of entries that need to be computed if we're 
    // looking for a match within a set distance
    for(int row = 1; row < s.length()+1; ++row) {
        char schar = s.charAt(row-1);
        d[0] = row;

        // set up our threshold window
        int min = Math.max(1, row - threshold);
        int max = Math.min(d.length, row + threshold + 1);

        // since we're reusing arrays, we need to be sure to wipe the value left of the
        // starting index; we don't have to worry about the value above the ending index
        // as the arrays were initially filled with large integers and we progress to the right
        if(min > 1)
            d[min-1] = Integer.MAX_VALUE;

        for(int col = min; col < max; ++col) {
            if(schar == t.charAt(col-1))
                d[col] = p[col-1];
            else 
                // min of: diagonal, left, up
                d[col] = Math.min(p[col-1], Math.min(d[col-1], p[col])) + 1;
        }
        // swap our arrays
        dtmp = p;
        p = d;
        d = dtmp;
    }

        if(p[tlen] == Integer.MAX_VALUE)
            return -1;
    return p[tlen];
}

不需要这个了,但感谢提供这个解决方案。那正是我在寻找的。 - AHungerArtist
我尝试了这段代码针对 'abcde' 和 'XXcde',正确计算出 Levenshtein 距离为 2。但是如果我将阈值设为 1,你的方法应该回答 -1,因为实际的阈值比较大,是不是?无论如何,它还是继续回答 2。除非我将阈值设为 0。不管怎样,它比默认实现要快得多! - Lars Blumberg

5

我之前写过关于Levenshtein自动机的文章,这是一种在O(n)时间内进行检查的方法,这里有相关内容。代码示例使用Python编写,但解释应该很有帮助,参考文献提供了更多细节。


这似乎是有用的,但我目前只是想看看仅使用阈值会有什么不同,因为我不确定我在这方面会有多少时间。 - AHungerArtist
此外,我们距离达到我们期望的目标非常接近,因此相对较小的更改比大的更改更好。 - AHungerArtist
正如您在原始问题中所说,如果您有一个阈值,那么它将需要O(n)时间而不是O(mn)时间。修改动态规划过程可能在您的情况下更简单,但我不确定您该如何去做。 - Nick Johnson

3
根据《Gusfield, Dan (1997). Algorithms on strings, trees, and sequences: computer science and computational biology》(第264页),您应该忽略零。

2
这里,有人回答了一个非常相似的问题:
引用:
我已经做了很多次。我使用递归深度优先树遍历可能更改的游戏树来完成它。有一个更改预算k,我用它来修剪树。有了这个例程,在手头上,首先我将其运行在k=0,然后k=1,然后k=2,直到我要么得到一个命中,要么不想再继续下去。
char* a = /* string 1 */;
char* b = /* string 2 */;
int na = strlen(a);
int nb = strlen(b);
bool walk(int ia, int ib, int k){
  /* if the budget is exhausted, prune the search */
  if (k < 0) return false;
  /* if at end of both strings we have a match */ 
  if (ia == na && ib == nb) return true;
  /* if the first characters match, continue walking with no reduction in budget */
  if (ia < na && ib < nb && a[ia] == b[ib] && walk(ia+1, ib+1, k)) return true;
  /* if the first characters don't match, assume there is a 1-character replacement */
  if (ia < na && ib < nb && a[ia] != b[ib] && walk(ia+1, ib+1, k-1)) return true;
  /* try assuming there is an extra character in a */
  if (ia < na && walk(ia+1, ib, k-1)) return true;
  /* try assuming there is an extra character in b */
  if (ib < nb && walk(ia, ib+1, k-1)) return true;
  /* if none of those worked, I give up */
  return false;
}  

仅包括主要部分,原文中有更多的代码


1

我使用了原始代码,并将其放置在 j 循环的末尾之前:

    if (p[n] > s.length() + 5)
        break;

这个+5是任意的,但为了我们的目的,如果距离是查询长度加五(或者我们商定的任何数字),那么返回什么并不重要,因为我们认为匹配结果太不相似了。这确实有点减少了一些东西。不过,我很确定这不是维基百科声明所说的想法,如果有人更好地理解了,请告诉我们。


0

Apache Commons Lang 3.4有如下实现:

/**
 * <p>Find the Levenshtein distance between two Strings if it's less than or equal to a given
 * threshold.</p>
 *
 * <p>This is the number of changes needed to change one String into
 * another, where each change is a single character modification (deletion,
 * insertion or substitution).</p>
 *
 * <p>This implementation follows from Algorithms on Strings, Trees and Sequences by Dan Gusfield
 * and Chas Emerick's implementation of the Levenshtein distance algorithm from
 * <a href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.htm</a></p>
 *
 * <pre>
 * StringUtils.getLevenshteinDistance(null, *, *)             = IllegalArgumentException
 * StringUtils.getLevenshteinDistance(*, null, *)             = IllegalArgumentException
 * StringUtils.getLevenshteinDistance(*, *, -1)               = IllegalArgumentException
 * StringUtils.getLevenshteinDistance("","", 0)               = 0
 * StringUtils.getLevenshteinDistance("aaapppp", "", 8)       = 7
 * StringUtils.getLevenshteinDistance("aaapppp", "", 7)       = 7
 * StringUtils.getLevenshteinDistance("aaapppp", "", 6))      = -1
 * StringUtils.getLevenshteinDistance("elephant", "hippo", 7) = 7
 * StringUtils.getLevenshteinDistance("elephant", "hippo", 6) = -1
 * StringUtils.getLevenshteinDistance("hippo", "elephant", 7) = 7
 * StringUtils.getLevenshteinDistance("hippo", "elephant", 6) = -1
 * </pre>
 *
 * @param s  the first String, must not be null
 * @param t  the second String, must not be null
 * @param threshold the target threshold, must not be negative
 * @return result distance, or {@code -1} if the distance would be greater than the threshold
 * @throws IllegalArgumentException if either String input {@code null} or negative threshold
 */
public static int getLevenshteinDistance(CharSequence s, CharSequence t, final int threshold) {
    if (s == null || t == null) {
        throw new IllegalArgumentException("Strings must not be null");
    }
    if (threshold < 0) {
        throw new IllegalArgumentException("Threshold must not be negative");
    }

    /*
    This implementation only computes the distance if it's less than or equal to the
    threshold value, returning -1 if it's greater.  The advantage is performance: unbounded
    distance is O(nm), but a bound of k allows us to reduce it to O(km) time by only
    computing a diagonal stripe of width 2k + 1 of the cost table.
    It is also possible to use this to compute the unbounded Levenshtein distance by starting
    the threshold at 1 and doubling each time until the distance is found; this is O(dm), where
    d is the distance.

    One subtlety comes from needing to ignore entries on the border of our stripe
    eg.
    p[] = |#|#|#|*
    d[] =  *|#|#|#|
    We must ignore the entry to the left of the leftmost member
    We must ignore the entry above the rightmost member

    Another subtlety comes from our stripe running off the matrix if the strings aren't
    of the same size.  Since string s is always swapped to be the shorter of the two,
    the stripe will always run off to the upper right instead of the lower left of the matrix.

    As a concrete example, suppose s is of length 5, t is of length 7, and our threshold is 1.
    In this case we're going to walk a stripe of length 3.  The matrix would look like so:

       1 2 3 4 5
    1 |#|#| | | |
    2 |#|#|#| | |
    3 | |#|#|#| |
    4 | | |#|#|#|
    5 | | | |#|#|
    6 | | | | |#|
    7 | | | | | |

    Note how the stripe leads off the table as there is no possible way to turn a string of length 5
    into one of length 7 in edit distance of 1.

    Additionally, this implementation decreases memory usage by using two
    single-dimensional arrays and swapping them back and forth instead of allocating
    an entire n by m matrix.  This requires a few minor changes, such as immediately returning
    when it's detected that the stripe has run off the matrix and initially filling the arrays with
    large values so that entries we don't compute are ignored.

    See Algorithms on Strings, Trees and Sequences by Dan Gusfield for some discussion.
     */

    int n = s.length(); // length of s
    int m = t.length(); // length of t

    // if one string is empty, the edit distance is necessarily the length of the other
    if (n == 0) {
        return m <= threshold ? m : -1;
    } else if (m == 0) {
        return n <= threshold ? n : -1;
    }

    if (n > m) {
        // swap the two strings to consume less memory
        final CharSequence tmp = s;
        s = t;
        t = tmp;
        n = m;
        m = t.length();
    }

    int p[] = new int[n + 1]; // 'previous' cost array, horizontally
    int d[] = new int[n + 1]; // cost array, horizontally
    int _d[]; // placeholder to assist in swapping p and d

    // fill in starting table values
    final int boundary = Math.min(n, threshold) + 1;
    for (int i = 0; i < boundary; i++) {
        p[i] = i;
    }
    // these fills ensure that the value above the rightmost entry of our
    // stripe will be ignored in following loop iterations
    Arrays.fill(p, boundary, p.length, Integer.MAX_VALUE);
    Arrays.fill(d, Integer.MAX_VALUE);

    // iterates through t
    for (int j = 1; j <= m; j++) {
        final char t_j = t.charAt(j - 1); // jth character of t
        d[0] = j;

        // compute stripe indices, constrain to array size
        final int min = Math.max(1, j - threshold);
        final int max = (j > Integer.MAX_VALUE - threshold) ? n : Math.min(n, j + threshold);

        // the stripe may lead off of the table if s and t are of different sizes
        if (min > max) {
            return -1;
        }

        // ignore entry left of leftmost
        if (min > 1) {
            d[min - 1] = Integer.MAX_VALUE;
        }

        // iterates through [min, max] in s
        for (int i = min; i <= max; i++) {
            if (s.charAt(i - 1) == t_j) {
                // diagonally left and up
                d[i] = p[i - 1];
            } else {
                // 1 + minimum of cell to the left, to the top, diagonally left and up
                d[i] = 1 + Math.min(Math.min(d[i - 1], p[i]), p[i - 1]);
            }
        }

        // copy current distance counts to 'previous row' distance counts
        _d = p;
        p = d;
        d = _d;
    }

    // if p[n] is greater than the threshold, there's no guarantee on it being the correct
    // distance
    if (p[n] <= threshold) {
        return p[n];
    }
    return -1;
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接