Java自由文本差异比较库

Question

Java自由文本差异比较库

javastringtextcomparisondiff

20

我需要匹配两个几乎相同的长自由文本字符串；即，尽可能找到索引对索引的对应关系。

因为这是自由文本，所以比较不应该像代码差异一样基于行。

有什么Java库的建议吗？

这里有一个简单的例子（在实际生活中，当然不会有额外的空格把事情排列好，并且可能会存在更复杂的挑战，如整个从句移动）。

The quick brown  fox jumped over the  lazy     dog.
||||||||||      |||||||||||||||||||||         |||||
The quick yellow fox jumped over the well-bred dog.

- redder

你面临的一个问题是文本差异工具/库通常基于逐行操作，这意味着它们只区分行是否相同或不同，如果它们不同，是因为其他行已被插入/删除吗？ - cletus

1

请查看用于生物信息学中对齐两个DNA序列的Needleman-Wunsch算法（http://en.wikipedia.org/wiki/Needleman-Wunsch_algorithm）的伪代码。 - Pierre

多个空格字符只是为了展示吗？还是你真的只想在相同索引处找到相似序列的对应项？如果是后者，那么这个问题很简单——不需要使用库，只需自己编写代码即可！ - Christoph

@Christoph “这些多个空格字符只是为了展示吗？”-- 如上所述，这些只是为了帮助说明匹配。 - Joshua Fox

4个回答

8

根据您的确切需求，Apache Commons Lang 组件的 StringUtils 类可能会有所帮助，例如：

StringUtils#difference：比较两个字符串，并返回它们不同的部分
StringUtils#getLevenshteinDistance：查找两个字符串之间的Levenshtein距离

- Fabian Steeg

1

这是一个（经过轻微测试的）代码版本，可以完成您所要求的功能。您可以轻松地与输入并行遍历结果，以定位插入和删除操作。

public class StringDiff {

    private static int   length(String s) { return s == null ? 0 : s.length(); }
    private static char[] chars(String s) { return s == null ? new char[0] : s.toCharArray(); }

    private final String left;
    private final String right;

    private final char[] lccs;
    private final String lcs;

    public StringDiff(String left, String right) {
        this.left = left;
        this.right = right;
        lccs = init();
        lcs = new String(lccs);
    }

    public String getLcs()  { return lcs; }
    public char[] getLccs() { return lccs.clone(); }

    private char[] init() {
        int lLength = length(left);
        int rLength = length(right);
        char[] lChars = chars(left);
        char[] rChars = chars(right);
        int [][] t = new int [lLength + 1][rLength + 1];
        for (int i = lLength - 1; i >= 0; --i) {
            for (int j = rLength - 1; j >= 0; --j) {
                if (lChars[i] == rChars[j]) {
                    t[i][j] = t[i + 1][j + 1] + 1;
                } else {
                    t[i][j] = Math.max(t[i + 1][j], t[i][j + 1]);
                }
            }
        }
        char[] result = new char[t[0][0]];
        int l = 0, r = 0, p = 0;
        while (l < lLength && r < rLength) {
            if (lChars[l] == rChars[r]) {
                result[p++] = lChars[l++];
                r++;
            } else {
                if (t[l + 1][r] > t[l][r + 1]) {
                    ++l;
                } else {
                    ++r;
                }
            }
        }
        return result;
    }

}

根据它，您原始输入的实际最长子序列为：

The quick brown  fox jumped over the  lazy     dog.
The quick yellow fox jumped over the well-bred dog.

是：

The quick ow fox jumped over the l dog.

因为“brown”和“yellow”都有“ow”，所以它们有共同的子序列。

将上述代码稍作修改，改为按空格分割（而不是字符数组），并用String#equals替换==，就可以得到一个查找单词最长公共子序列的版本。对于您上面的示例，这种更改将产生明显的结果：

found 7 words
    'The'
    'quick'
    'fox'
    'jumped'
    'over'
    'the'
    'dog.'

你的问题涉及字符比较，因为你匹配了单词之间的空格。

- joel.neely

那不是他要求的，对吗？它应该是基于索引到索引的，因此不会返回 ow、yellow 和 brown 的匹配项，因为它们不在同一个索引上。 - willcodejavaforfood

0

如果您的示例确实是您想要做的 - 即仅当子序列从相同的索引开始时才匹配（这与差异通常的操作方式不同） - 那么您只需要执行以下操作：

import java.util.*;

class StringDiff {
    public static List<int[]> from(String s1, String s2) {
        int start = -1;
        int pos = 0;
        LinkedList<int[]> list = new LinkedList<int[]>();

        for(; pos < s1.length() && pos < s2.length(); ++pos) {
            if(s1.charAt(pos) == s2.charAt(pos)) {
                if(start < 0) start = pos;
            }
            else {
                if(start >= 0) list.add(new int[] { start, pos });
                start = -1;
            }
        }

        if(start >= 0) list.add(new int[] { start, pos });

        return list;
    }

    public static void main(String[] args) {
        for(int[] idx : from(args[0], args[1]))
            System.out.println(args[0].substring(idx[0], idx[1]));
    }
}

一个真正的差异实现会更加复杂。

- Christoph

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- redder · Accepted Answer

这个可能很不错 Diff Match Patch。