不分割单词的最长公共子串

Question

不分割单词的最长公共子串

3

我正在尝试实现修改版本的最长公共子串算法，其中我不想将单词分开。

例如：

String s1 = "hi there how are you";
String s2 = "hi there how ar";

输出应该是：hi there how。

我首先计算最长公共子串，然后过滤任何被分开的单词，以下是相应的代码：

private static void longestCommonSubstring(String S1, String S2) {
        int Start = 0;
        int Max = 0;
        for (int i = 0; i < S1.length(); i++) {
            for (int j = 0; j < S2.length(); j++) {
                int x = 0;
                while (S1.charAt(i + x) == S2.charAt(j + x)) {
                    x++;
                    if (((i + x) >= S1.length()) || ((j + x) >= S2.length()))
                        break;
                }
                if (x > Max) {
                    Max = x;
                    Start = i;
                }
            }
        }
        System.out.println("S1 " + S1 + " S2 " + S2 + " " + Start + " " + Max);
        System.out.println("ans " + S1.substring(Start, (Start+Max)));


        if(Start != 0){
            if((S1.charAt(Start-1) == ' ')){
                if(Start+Max != S1.length()){
                    if((S1.charAt(Start+Max) == ' ')){
                        System.out.println(S1.substring(Start, (Start + Max)));
                    }
                }
            }
            else if((S1.charAt(Start+Max) == ' ')){
                int index = S1.indexOf(' ', Start);
                System.out.println(S1.substring(index+1, (Start + Max)));
            }
            else{
                int index = S1.indexOf(' ', Start);
                int lastIndex = S1.lastIndexOf(' ', (Start+Max));
                if(index+1 != lastIndex+1){
                    System.out.println(S1.substring(index+1, lastIndex));
                }
                else{
                    System.out.println(" ");
                }
            }
        }
        else if(Start == 0 && Start+Max != S1.length() && (S1.charAt(Start+Max) == ' ')){
            System.out.println(S1.substring(Start, (Start + Max)));
        }
        else if(Start+Max != S1.length()){
            int index = S1.lastIndexOf(' ', (Start+Max));
            System.out.println(S1.substring(Start, index));
        }
        else{
            System.out.println(S1.substring(Start, (Start + Max)));
        }


    }

但是对于以下情况，它会失败：

String s1 = "hi there how are you";
String s2 = "i ere ho ar you";

输出应该是"you"，但是却是空的，因为最长公共子串是"ere ho"。

感谢您的帮助。

- bna

2

它不会失败，它会完全按照你目前编写的代码执行。如果你只想要完整的单词，那就检查完整的单词而不仅仅是 charAt。 - Murat Karagöz

维基百科上的一个例子看起来比你的例子更清晰，所以分享一下 https://en.wikipedia.org/wiki/Longest_common_substring_problem#Example - Srinath Ganesh

可能是Java实现n个字符串最长公共子串的问题的重复。 - austin wernli

3个回答

0

你想要做的是将文本分割成单词，然后比较整个单词，而不是单个字母。

- Dakkaron

@austinwernli “不要拆分单词”并不意味着不将字符串拆分为单词，而是比较完整的单词。老实说，这很容易实现，只需将原始字符串拆分为由完整单词组成的字符串数组即可。 - crigore

@austinwernli：你读了问题吗？OP想要找到最长的公共子串，而不需要分割单词。也就是说，他不希望在匹配中出现任何部分单词。没有部分单词=只有完整单词。拆分字符串并不意味着拆分单词。更具体地说，我说“将文本拆分为单词”=>不要拆分单词，“然后比较完整的单词”=>不要比较部分单词。 - Dakkaron

也许 @bna 应该澄清他是如何定义“我不想拆分单词”的，而不是让其他人就此争论。 - Srinath Ganesh

@SrinathGanesh 我认为从bna的问题所提供的示例中可以很清楚地看出意思。我同意“without splitting words”这个短语用法不太恰当，但是目前我没有更好的建议。 - ReyCharles

1

抱歉造成困惑：@Dakkaron是正确的。答案中不应包含部分单词。 - bna

0

这将有助于找到长度，我认为找到实际字符串也并不那么困难。

使用您的输入创建单词数组

String s1 = "hi there how are you" => X[] ={"hi","there","how","are","you"}

String s1 = "hi there how are you" => Y[] ={"hi","there","how","ar"}

现在在您的数组中查找LCS

     LCS(X, Y, m, n) = LCS(X, Y, m-1, n-1) + 1 if X[m-1].equals(Y[n-1])
                       0  Otherwise (if !X[m-1].equals(Y[n-1]))

- Karthik

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tucuxi · Accepted Answer

根据卡尔蒂克的观点，并在BNA指出任何基于字符的答案都存在缺陷后：

public static ArrayList<String> stringToTokens(String s) {
    ArrayList<String> al = new ArrayList<String>();
    StringBuilder word = new StringBuilder();
    boolean inWord = !s.isEmpty() && Character.isLetter(s.charAt(0));
    for (int i=0; i<s.length(); i++) {
        char c = s.charAt(i);
        if (Character.isLetter(c)) {
            word.append(c);
        } else if (inWord) {
            if (word.length() > 0) {
                al.add(word.toString());
                word.setLength(0);
            }
            al.add(""+c);
        } else {
            al.add(""+c);
        }
    }
    if (word.length() > 0) {
        al.add(word.toString());
    }
    return al;
}

public static String longestCommonWithWords(String sa, String sb) {
    ArrayList<String> a = stringToTokens(sa);
    ArrayList<String> b = stringToTokens(sb);
    int m[][] = new int[a.size()][b.size()];
    int bestLength = 0;
    HashSet<Integer> bestSet = new HashSet<Integer>();
    for (int i=0; i<a.size(); i++) {
        for (int j=0; j<b.size(); j++) {
            if (a.get(i).equals(b.get(j))) {
                m[i][j] = (i==0 || j==0) ? 1 : m[i-1][j-1]+1;
                if (m[i][j] > bestLength) {
                    bestLength = m[i][j];
                    bestSet.clear();
                    bestSet.add(i);
                } else if (m[i][j] == bestLength) {
                    bestSet.add(i);
                }
            } else {
                m[i][j] = 0;
            }
        }
    }
    // return first result (or empty string if none)
    StringBuilder result = new StringBuilder();
    if (bestLength > 0) {
        int end = bestSet.iterator().next();
        int start = end - bestLength;
        for (int i=start; i<end; i++) {
            result.append(a.get(i+1));
        }
    }
    return result.toString();
}

标记化很简单（StringTokenizer 也可以工作，但这个版本处理奇怪字符更好）。LCS 算法直接来自于 https://en.wikipedia.org/wiki/Longest_common_substring_problem#Pseudocode 中的伪代码。