Java正则表达式：从字符串中删除重复的子字符串

Question

Java正则表达式：从字符串中删除重复的子字符串

5

我正在尝试在Java中构建一个正则表达式，以“减少”字符串中的重复连续子字符串。例如，对于以下输入：

The big black dog big black dog is a friendly friendly dog who lives nearby nearby.

我想要得到以下输出：

The big black dog is a friendly dog who lives nearby.

这是我目前的代码：

String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

Pattern dupPattern = Pattern.compile("((\\b\\w+\\b\\s)+)\\1+", Pattern.CASE_INSENSITIVE);
Matcher matcher = dupPattern.matcher(input);

while (matcher.find()) {
    input = input.replace(matcher.group(), matcher.group(1));
}

这对所有重复的子字符串都起作用得很好，除了句子结尾部分：

The big black dog is a friendly dog who lives nearby nearby.

我了解到我的正则表达式需要在子字符串中的每个单词后面加一个空格，这意味着它无法捕获用句点而不是空格的情况。我似乎找不到解决这个问题的方法，我尝试过调整捕获组并将正则表达式更改为查找空格或句点而不仅仅是空格，但这种解决方案只有在每个重复部分的子字符串后面都有句点时才有效（“nearby.nearby.”）。

有人能指导我正确的方向吗？理想情况下，此方法的输入将是短段落，而不仅仅是一行。

- ak_charlie

1

你必须使用正则表达式吗？还是你只是对高效解决方案感兴趣？ - Jan B.

其实我不必使用正则表达式，只是认为正则表达式可以轻松找到重复的短语而不仅仅是重复的单词。任何其他解决方案也将受到欢迎！ - ak_charlie

2个回答

2

结合@Thomas Ayoub和@Matt的评论。

public class Test2 {
    public static void main(String args[]){
        String input = "The big big black dog big black dog is a friendly friendly dog who lives nearby nearby.";
        String result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        while(!input.equals(result)){
            input = result;
            result = input.replaceAll("\\b([ \\w]+)\\1", "$1");
        }
        System.out.println(result);
    }
}

- Eugene

为什么要引入result变量？ - Thomas Ayoub

@ThomasAyoub 嗯，也许为了更好的可读性。你有什么意见？ - Eugene

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Thomas Ayoub · Accepted Answer

您可以使用

input.replaceAll("([ \\w]+)\\1", "$1");

请查看实时演示：

import java.io.*;
import java.util.regex.Matcher;
import java.util.regex.Pattern;

class Ideone
{
    public static void main (String[] args) throws java.lang.Exception
    {
        String input = "The big black dog big black dog is a friendly friendly dog who lives nearby nearby.";

        Pattern dupPattern = Pattern.compile("([ \\w]+)\\1", Pattern.CASE_INSENSITIVE);
        Matcher matcher = dupPattern.matcher(input);

        while (matcher.find()) {
            input = input.replaceAll("([ \\w]+)\\1", "$1");
        }
        System.out.println(input);

    }
}