Java中Levenshtein算法存在的问题

12
我希望能使用Levenshtein算法来完成以下任务:如果用户在我的网站上搜索某个值(他在输入框中输入字符),我希望即时通过AJAX检查建议,就像Google Instant一样。
我有个印象,Levenshtein算法对于这样的任务来说速度太慢了。为了检查其行为,我首先在Java中实现它,在方法的每个递归调用中打印出两个String
public class Levenshtein {
    public static void main(String[] arg){
        String a = "Hallo Zusammen";
        String b = "jfdss Zusammen";

        int res = levenshtein(a, b);

        System.out.println(res);
    }

    public static int levenshtein(String s, String t){
        int len_s = s.length();
        int len_t = t.length();
        int cost = 0;

        System.out.println("s: " + s + ", t: " + t);

        if(len_s>0 && len_t>0){
            if(s.charAt(0) != t.charAt(0)) cost = 1;
        }

        if(len_s == 0){
            return len_t;
        }else{
            if(len_t == 0){
                return len_s;
            }else{
                String news = s.substring(0, s.length()-1);
                String newt = t.substring(0, t.length()-1);
                return min(levenshtein(news, t) + 1,
                            levenshtein(s, newt) + 1,
                            levenshtein(news, newt) + cost);
            }
        }
    }

    public static int min(int a, int b, int c) {
          return Math.min(Math.min(a, b), c);
    }
}

然而,这里有几点需要注意:
  • 我添加了检查if(len_s>0 && len_t>0),因为在上述测试值下我遇到了StringIndexOutOfBoundsException异常。
  • 在上述测试值下,该算法似乎会无限计算。

是否有优化方法可以使该算法适用于我,或者我应该使用完全不同的算法来完成所需的任务?

5个回答

30

1) 关于Levenshtein距离算法改进的几句话

递归实现的Levenshtein距离具有指数复杂度

建议您使用记忆化技术来实现不用递归的Levenshtein距离,将复杂度降低到O(N^2)(需要O(N^2)内存)

public static int levenshteinDistance( String s1, String s2 ) {
    return dist( s1.toCharArray(), s2.toCharArray() );
}

public static int dist( char[] s1, char[] s2 ) {

    // distance matrix - to memoize distances between substrings
    // needed to avoid recursion
    int[][] d = new int[ s1.length + 1 ][ s2.length + 1 ];

    // d[i][j] - would contain distance between such substrings:
    // s1.subString(0, i) and s2.subString(0, j)
 
    for( int i = 0; i < s1.length + 1; i++ ) {
        d[ i ][ 0 ] = i;
    }
 
    for(int j = 0; j < s2.length + 1; j++) {
        d[ 0 ][ j ] = j;
    }
 
    for( int i = 1; i < s1.length + 1; i++ ) {
        for( int j = 1; j < s2.length + 1; j++ ) {
            int d1 = d[ i - 1 ][ j ] + 1;
            int d2 = d[ i ][ j - 1 ] + 1;
            int d3 = d[ i - 1 ][ j - 1 ];
            if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
                d3 += 1;
            }
            d[ i ][ j ] = Math.min( Math.min( d1, d2 ), d3 );
        }
    }
    return d[ s1.length ][ s2.length ];
}

或者,更好的是 - 你可能会注意到,对于距离矩阵中的每个单元格,你只需要关于前一行的信息,因此你可以将内存需求降低到O(N)

public static int dist( char[] s1, char[] s2 ) {

    // memoize only previous line of distance matrix     
    int[] prev = new int[ s2.length + 1 ];
 
    for( int j = 0; j < s2.length + 1; j++ ) {
        prev[ j ] = j;
    }
 
    for( int i = 1; i < s1.length + 1; i++ ) {

        // calculate current line of distance matrix     
        int[] curr = new int[ s2.length + 1 ];
        curr[0] = i;
 
        for( int j = 1; j < s2.length + 1; j++ ) {
            int d1 = prev[ j ] + 1;
            int d2 = curr[ j - 1 ] + 1;
            int d3 = prev[ j - 1 ];
            if ( s1[ i - 1 ] != s2[ j - 1 ] ) {
                d3 += 1;
            }
            curr[ j ] = Math.min( Math.min( d1, d2 ), d3 );
        }

        // define current line of distance matrix as previous     
        prev = curr;
    }
    return prev[ s2.length ];
}

2) 关于自动完成的几句话

如果您只需要找到精确匹配,那么莱文斯坦距离就足够了。

但是如果您搜索的关键词是apple而用户输入的是green apples呢?莱文斯坦距离会很大(7点)。而在愚蠢串bcdfghkapple之间的莱文斯坦距离也是7点

我建议您使用全文搜索引擎(例如Lucene)。技巧是-你必须使用n-gram模型来表示每个关键词。

简言之:
1) 您必须将每个关键词表示为包含n-gram的文档: apple -> [ap,pp,pl,le]

2) 在将每个关键词转换为一组n-gram后,您必须通过n-gram索引器在您的搜索引擎中索引每个关键词-文档。 您需要创建这样的索引:

...
ap -> apple, map, happy ...
pp -> apple ...
pl -> apple, place ...
...

3) 所以你有n-gram索引。 当你得到一个查询时,你必须将其拆分成n-gram。然后,您将获得一组用户查询的n-grams。而你所需要做的,就是从你的搜索引擎中匹配最相似的文档。在草案方法中,这已经足够了。

4) 为了更好的建议 - 你可以通过Levenshtein距离对搜索引擎的结果进行排序。

P.S. 我建议您阅读书籍"信息检索导论"


7
你可以使用Apache Commons Lang3的 StringUtils.getLevenshteinDistance()函数:

Find the Levenshtein distance between two Strings.

This is the number of changes needed to change one String into another, where each change is a single character modification (deletion, insertion or substitution).

The previous implementation of the Levenshtein distance algorithm was from http://www.merriampark.com/ld.htm

Chas Emerick has written an implementation in Java, which avoids an OutOfMemoryError which can occur when my Java implementation is used with very large strings.

This implementation of the Levenshtein distance algorithm is from http://www.merriampark.com/ldjava.htm

 StringUtils.getLevenshteinDistance(null, *)             = IllegalArgumentException
 StringUtils.getLevenshteinDistance(*, null)             = IllegalArgumentException
 StringUtils.getLevenshteinDistance("","")               = 0
 StringUtils.getLevenshteinDistance("","a")              = 1
 StringUtils.getLevenshteinDistance("aaapppp", "")       = 7
 StringUtils.getLevenshteinDistance("frog", "fog")       = 1
 StringUtils.getLevenshteinDistance("fly", "ant")        = 3
 StringUtils.getLevenshteinDistance("elephant", "hippo") = 7
 StringUtils.getLevenshteinDistance("hippo", "elephant") = 7
 StringUtils.getLevenshteinDistance("hippo", "zzzzzzzz") = 8
 StringUtils.getLevenshteinDistance("hello", "hallo")    = 1

0

0
import java.util.Scanner;

public class Algorithmm {
    public static void main(String args[])
    {
        Scanner sc= new Scanner(System.in);
        System.out.println("Enter the correct string ");
        String correct=sc.nextLine();
        System.out.println("Enter the incorrect string ");
        String incorrect=sc.nextLine();
        int i=correct.length(),j=incorrect.length();
        ++i ; ++j;
        int a[][] = new int[i][j];
        int b[] = new int[3];       
        for(int m=0;m<i;m++)
            for(int n=0;n<j;n++)
            {

                        if(m==0 || n==0)
                        {
                          a[0][n]=n;
                          a[m][0]=m;
                        }
                        else
                        {
                            b[0]=a[m-1][n-1]; b[1]=a[m-1][n]; b[2]=a[m][n-1];


                            if ( correct.charAt(m-1) == incorrect.charAt(n-1)  )
                            {
                                a[m][n]=a[m-1][n-1];
                            }

                            else
                            {
                                for(int t=0;t<2;t++)
                                    for(int u=0;u<2-t;u++)
                                        if(b[u]>b[u+1])
                                            b[u]=b[u+1];


                                a[m][n]=b[0]+1;


                            }

                        }

            }


        for(int m=0;m<i;m++)
        {
            for(int n=0;n<j;n++)
                System.out.print( a[m][n] +"  ");  
            System.out.print("\n");                
        }



        System.out.println(" Levenshtein distance :  "+a[i-1][j-1]);

    }

}

0
public class Algorithmm {
    public static void main(String args[])
    {
        Scanner sc= new Scanner(System.in);
        System.out.println("Enter the correct string ");
        String correct=sc.nextLine();
        System.out.println("Enter the incorrect string ");
        String incorrect=sc.nextLine();
        int i=correct.length(),j=incorrect.length();
        ++i ; ++j;
        int a[][] = new int[i][j];
        int b[] = new int[3];       
        for(int m=0;m<i;m++)
            for(int n=0;n<j;n++)
            {               
                        if(m==0 || n==0)
                        {
                           a[0][n]=n;
                           a[m][0]=m;
                        }
                        else
                        {
                            b[0]=a[m-1][n-1]; b[1]=a[m-1][n]; b[2]=a[m][n-1];    
                            if ( correct.charAt(m-1) == incorrect.charAt(n-1)  )                        
                                a[m][n]=a[m-1][n-1];                                                        
                            else
                            {
                       //instead of using the above code for finding the smallest number in       the array 'b' we can simplyfy that code to the following, so that we can reduce the execution time.//

                                if(  (b[0]<=b[1]) && (b[0])<=b[2]  )
                                    a[m][n]=b[0]+1;
                                else if(  (b[1]<=b[0]) && (b[1])<=b[2]  )
                                    a[m][n]=b[1]+1;
                                else
                                    a[m][n]=b[2]+1;    
                            }                            
                        }                
            }               
        for(int m=0;m<i;m++)
        {
            for(int n=0;n<j;n++)
                System.out.print( a[m][n] +"  ");  
            System.out.print("\n");                
        }       
        System.out.println("
Levenshtein distance :
  "+a[i-1][j-1]);        
    }
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接