如何使用单词列表拆分字符串？

Question

如何使用单词列表拆分字符串？

4

我如何使用预设的字符串列表来分割一个字符串，并在它们之间添加空格？

例如:

字符串列表: words = {"hello", "how", "are", "you"} 要分割的字符串: text = "hellohowareyou"

public static String separateText(String text, List<String> words) {
    String new_text;

    for (String word : words) {
        if (text.startsWith(word)) {
            String suffix = text.substring(word.length());  //'suffix' is the 'text' without it's first word
            new_text += " " + word;  //add the first word of the 'string'
            separateString(suffix, words);
        }
    }
    
    return new_text;
}

而 new_text 应该返回 你好吗你好吗

请注意列表words的顺序可能不同，也可能有更多单词，就像一个字典。

如果需要，我要如何进行递归？

- Afonso Hipólito

1

根据Oracle Java的命名规范，您应该使用camelCase为本地变量命名。 - Ewan Brown

3

你有没有一种规则来消除歧义，当文本为“ihoweverywhere”，而字典中包含诸如[“ever”、“every”、“how”、“however”、“where”、“everywhere”]这样的单词，并且顺序未定义时？ - dshelya

1

如果单词列表是 ["now", "here", "no", "where"]，如何分割 "nowhere"？由于要先找到最短的单词，所以分割成 "no" 和 "where"？如果选择最短的单词导致无路可走，例如输入为 "nownow"，选择 "no" 会留下 "wnow"，这时是否需要支持回溯？或者选择最短的单词（如果这是行为）保证不需要回溯？ - Bohemian

1

@Ma3x 如果没有 OP 提供明确的规则集（例如，“rangeofanorange”无论我们以什么顺序提供 [an, of, or, range, orange]，都无法正确分隔），排序并不能帮助太多。但是除此之外，你的解决方案具有 O(n * m) 的时间复杂度（其中 m 是文本长度，n 是字典大小），我同意它至少是一个很好的起点。 - dshelya

这就是为什么我说我的解决方案按照单词列表中指定的顺序进行，这使得OP可以实现一些规则集。如果OP有不同的规格（不能仅通过单词列表中的排序来支持并且需要特定的实现，甚至需要回溯），他们会告诉我们。按照列表优先级，它将分为“现在在这里”和“或橙色范围”，最后一个单词将被错误地识别为未知单词。我仍在等待OP提供任何规格更新。 - Ma3x

显示剩余2条评论

5个回答

1

如何使用预先给定的字符串列表将字符串分隔开，并用空格进行分隔？

基本上，您已经开始了。检查剩余文本是否以列表中的任何单词开头，删除起始单词并保留后缀。

您已经完成了所有这些操作，但是您决定尝试递归调用“separateText”，而不仅仅是保留后缀并继续迭代。

这也是一种可能性，但是即使只是正常迭代，直到后缀（或剩余文本）为空，也足够了。

使用像

while（index 这样的循环，即使单词顺序不同，也可以处理更长的输入。
public String separateText(String text, List<String> words){
    if (text == null) return "";
    if (words == null || words.isEmpty()) return text;

    StringBuilder sb = new StringBuilder();

    boolean unknownWord = false;
    int index = 0;
    while (index < text.length()) {
        boolean wordFound = false;
        for (String word : words) {
            if (!word.isEmpty() && text.startsWith(word, index)) {
                wordFound = true;
                // move the index ahead just past the last letter of the word found
                index += word.length();
                if (unknownWord) {
                    unknownWord = false;
                    sb.append(" ");
                }
                sb.append(word);
                sb.append(" ");
                break;
            }
        }
        if (!wordFound) {
            unknownWord = true;
            sb.append(text.charAt(index));
            index++;
        }
    }

    return sb.toString();
}

- Ma3x

1

这段代码不可行，因为它假设：a) words 字典“知道”文本中可能出现的所有单词。b) 字典中单词的顺序与它们在文本中出现的顺序相同。因此，separateText("iamremoteserver", asList("server", "am", "i")) 或 separateText("iamremoteserver", asList("extra")) 都会导致无限循环。可以通过使用 StringBuilder 替换 + 并将局部变量名改为驼峰式来改进此代码。 - dshelya

1

这对我不起作用，它会一直循环。 - Afonso Hipólito

1

很抱歉（对两位评论者），但你们说：“请注意，列表words的顺序可能不同，并且可能有更多的单词，就像字典一样。” 你们没有说输入可能有其他单词。如果是这种情况，请编辑问题以指定它。 - Ma3x

@dshelya 关于a）我按照原帖描述进行操作，如果规格不同，原帖作者会编辑问题。b）这根本不是一个假设，而且它适用于任何顺序。至于StringBuilder，我完全同意。 - Ma3x

@Ma3x 对于 b) - 是的，我的错误，它会起作用（尽管它会将一个空格作为第一个字符）。 - dshelya

@AfonsoHipólito，已经更改为StringBuilder，使其能够适应各种输入，添加了对未知单词的支持，并优化了迭代以避免生成子字符串。如果现在可以，请告诉我是否完成了任务。 - Ma3x

1

这个解决方案很简单，但不是内存最优的，因为会创建很多新的String。

public static String separate(String str, Set<String> words) {
    for (String word : words)
        str = str.replace(word, word + ' ');

    return str.trim();
}

演示

Set<String> words = Set.of("hello", "how", "are", "you");
System.out.println(separate("wow hellohowareyouhellohowareyou", words));
// wow hello how are you hello how are you

另一种解决方案是使用 StringBuilder，从性能角度来看更好。

public static String separate(String str, Set<String> words) {
    List<String> res = new LinkedList<>();
    StringBuilder buf = new StringBuilder();

    for (int i = 0; i < str.length(); i++) {
        buf.append(str.charAt(i));

        if (str.charAt(i) == ' ' || words.contains(buf.toString())) {
            res.add(buf.toString().trim());
            buf.delete(0, buf.length());
        }
    }

    return String.join(" ", res);
}

- oleg.cherednik

简单易懂（但不是递归的）。如果一个单词可以出现多次，它将无法按预期工作。 - c0der

1

@c0der 阿方索·希波利托 询问了递归是否必要的问题。在这里不需要使用递归。str.replace()函数可以替换所有出现的字符串，因此给定字符串中的多个单词将被成功替换。 - oleg.cherednik

你是对的。它被标记为递归，因此它需要解释。总的来说，这个问题没有很好地定义。 - c0der

0

对于递归方法，请尝试以下操作：

public static String separateText(String text, List<String> words){
    return separateText(text, words, new StringBuilder());
}

public static String separateText(String text, List<String> words, StringBuilder result){

    for(String word : words){
        if (text.startsWith(word)){
           result.append(word).append(" ");
           text = text.substring(word.length());
           ArrayList<String> newList = new ArrayList<>(words);
           newList.remove(word);
           separateText(text, newList, result);
           break;
        }
    }

    return result.toString().trim();
}

- c0der

顺便提一下，ArrayList<String> newList = new ArrayList<>(words); 应该改为 List<String> newList = new ArrayList<>(words);。而且这样做并不高效，因为想象一下如果你有100万个单词，每次迭代都要复制一次。在这里最好使用另一种集合类型。 - oleg.cherednik

可以使用List<String> newList = new ArrayList<>(words);，但我没有看到很大的好处。你会使用哪种其他集合？ - c0der

队列：在递归调用之前删除一个单词，然后在之后添加它。 - oleg.cherednik

我不会这样做，也不会使用递归解决1M个单词的问题。 - c0der

2

该解决方案假定所有单词都是已知的并存在于字典中。如果至少有一个单词未知，则结果为空:(不确定是否存在问题。 - dshelya

该解决方案假定所有单词都存在于字典中。是的，如果单词不在字典中，则无法将其分割为有意义的内容。该字典可能包含许多其他单词，也可能包含同一个单词的多个条目。 - c0der

0

import java.util.*;

public class Main {
    public static void main(String[] args) throws Exception {
        // You must sort this by it's length, or you will not have correct result
        // since it may cause match with more shorter words.
        // In this example, it's done
        List<String> words = Arrays.asList("hello", "how", "are", "you");
        List<String> detectedWords = new ArrayList<>();
        String text = "hellohowareyou";
        int i = 0;
        while (i < text.length()) {
            Optional<String> wordOpt = Optional.empty();

            for (String word : words) {
                if (text.indexOf(word, i) >= 0) {
                    wordOpt = Optional.of(word);
                    break;
                }
            }
            if (wordOpt.isPresent()) {
                String wordFound = wordOpt.get();
                i += wordFound.length();
                detectedWords.add(wordFound);
            }
        }
        String result = String.join(" ", detectedWords);
        System.out.println(result);
    }
}

我假设：

你的文本永远不会是 null
你的文本匹配正则表达式 ^(hello|how|are|you)$
你的单词必须排序

- user-id-14900042

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ewan Brown · Accepted Answer

这应该符合您的要求

如果您发现自己需要反复追加字符串，应使用StringBuilder
使用while循环遍历text，一次删除一个单词，并在text为空时结束

public static String separateText(String text, List<String> words){
        StringBuilder newTextBuilder = new StringBuilder();

        outerLoop:
        while(text.length() > 0){
            for(String word : words){
                if(text.startsWith(word)){
                    newTextBuilder.append(word + " ");
                    text = text.substring(word.length());
                    continue outerLoop;
                }
            }
        }

        return newTextBuilder.toString();
    }
}