Java中用于查找重复连续单词的正则表达式

Question

Java中用于查找重复连续单词的正则表达式

20

我在stackoverflow上看到了一个用来查找字符串中重复单词的答案。但是当我使用它时，它认为This和is是相同的，并删除了is。

正则表达式

"\\b(\\w+)\\b\\s+\\1"

任何想法为什么会发生这种情况？

以下是我用于去重的代码。

public static String RemoveDuplicateWords(String input)
{
    String originalText = input;
    String output = "";
    Pattern p = Pattern.compile("\b(\w+)\b\s+\b\1\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); 
    //Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\1", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(input);
    if (!m.find())
        output = "No duplicates found, no changes made to data";
    else
    {
        while (m.find())
        {
            if (output == "")
                output = input.replaceFirst(m.group(), m.group(1));
            else
                output = output.replaceAll(m.group(), m.group(1));
        }
        input = output;
        m = p.matcher(input);
        while (m.find())
        {
            output = "";
            if (output == "")
                output = input.replaceAll(m.group(), m.group(1));
            else
                output = output.replaceAll(m.group(), m.group(1));
        }
    }
    return output;
}

- user1190265

1

我认为应该这样写：\b(\w+)\b\s+\1\b，否则它会将'ice'和'icecream'视为重复。 - Niall Byrne

http://rubular.com/r/Qr3twc03RR（我又调整了一下，看起来是单词边界的问题... \b(\w+)\b\s+\b\1\b） - Niall Byrne

在结尾添加另一个单词边界对我来说效果很好。但即使没有那个，你的正则表达式也不应该匹配"This is"。你的问题可能出在其他地方，虽然我无法想象会是哪里。 - Alan Moore

虽然你已经得到了答案，但你可能考虑改变你的方法。一个基本的分词器和类似集合的结构更易于理解，也可能更有效率。 - M Platvoet

2

正则表达式现在是正确的，但你需要再次加倍所有那些反斜杠。如此一来，代码甚至无法编译。而且，你正在做大量不必要的工作。整个方法可以写成 return input.replaceAll("(?i)\\b(\\w+)\\s+\\1\\b", "$1");。 - Alan Moore

@user1190265：希望问题已经解决了... - Fahim Parkar

7个回答

10

下面的模式将匹配重复的单词，即使有任意数量的出现。

Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);

例如，“这是我的我的我的朋友朋友朋友朋友”将输出“这是我的朋友”。

同样，只需要一次迭代使用“while（m.find()）”与此模式足够。

- user5393067

9

你应该使用\b(\w+)\b\s+\b\1\b，点击这里查看结果...

希望这是你想要的...

更新 1

好了好了，你得到的输出是

删除重复项后的最终字符串

import java.util.regex.*;

public class MyDup {
    public static void main (String args[]) {
    String input="This This is text text another another";
    String originalText = input;
    String output = "";
    Pattern p = Pattern.compile("\\b(\\w+)\\b\\s+\\b\\1\\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE);
    Matcher m = p.matcher(input);
    System.out.println(m);
    if (!m.find())
        output = "No duplicates found, no changes made to data";
    else
    {
        while (m.find())
        {
            if (output == "") {
                output = input.replaceFirst(m.group(), m.group(1));
            } else {
                output = output.replaceAll(m.group(), m.group(1));
            }
        }
        input = output;
        m = p.matcher(input);
        while (m.find())
        {
            output = "";
            if (output == "") {
                output = input.replaceAll(m.group(), m.group(1));
            } else {
                output = output.replaceAll(m.group(), m.group(1));
            }
        }
    }
    System.out.println("After removing duplicate the final string is " + output);
}

运行以下代码并查看输出结果...您的查询将得到解决...

注意

在输出中，您正在将重复的单词替换为单个单词...是吗？

当我在第一个if条件中放置System.out.println(m.group() + " : " + m.group(1));时，我得到的输出结果为text text : text，即重复的单词被替换为单个单词。

else
    {
        while (m.find())
        {
            if (output == "") {
                System.out.println(m.group() + " : " + m.group(1));
                output = input.replaceFirst(m.group(), m.group(1));
            } else {

希望你现在明白正在发生什么... :)

祝你好运!!! 干杯!!!

- Fahim Parkar

谢谢，我会尝试的...正则表达式总是让我感到困扰。 - user1190265

仍然不起作用，我仍然得到“is in This is removed: \nThis is is an example example of duplicate.”的结果，使用以下代码：Pattern p = Pattern.compile("\b(\w+)\b\s+\b\1\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); //Pattern p = Pattern.compile("\b(\w+)\b\s+\1", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); Matcher m = p.matcher(input); - user1190265

如果我不使用双反斜杠，则会报错：54: illegal escape character。模式 p = Pattern.compile("\b(\w+)\b\s+\b\1\b", Pattern.MULTILINE+Pattern.CASE_INSENSITIVE); - user1190265

双反斜杠是必要的，因为正则表达式是Java字符串字面量的形式。请不要试图将讨论带到外部。我们需要的任何源代码都应该包含在每个人都可以看到的问题中。@OP，代码片段也不应该出现在评论中。请编辑您的问题并将代码添加到其中。 - Alan Moore

它适用于单行，但当我读入包含多行文本的文件时，它会剥夺太多内容。当输入为“This is is is is is is is an example example example of duplicate.\nThis is is another another example example.”时，输出为：“This an example of duplicate. This another example.” - user1190265

显示剩余2条评论

5

\b(\w+)(\b\W+\1\b)*

说明：

\b : Any word boundary <br/>(\w+) : Select any word character (letter, number, underscore)

在选择完所有单词之后，现在是选择常见单词的时候了。

( : Grouping starts<br/>
\b : Any word boundary<br/>
\W+ : Any non-word character<br/>
\1 : Select repeated words<br/>
\b : Un select if it repeated word is joined with another word<br/>
) : Grouping ends

Reference : Example

- imbond

1

这应该是被接受的答案，因为详细地解释了全部内容。 - Nam G VU

2

如果 Unicode 很重要，那么你应该使用这个：

 Pattern.compile("\\b(\\w+)(\\b\\W+\\b\\1\\b)*",
        Pattern.MULTILINE + Pattern.CASE_INSENSITIVE + Pattern.UNICODE_CHARACTER_CLASS)

- András

2

我认为这是您应该使用的正则表达式，以便检测由任意数量的非单词字符分隔的2个连续单词：

Pattern p = Pattern.compile("\\b(\\w+)\\b\\W+\\b\\1\\b", Pattern.CASE_INSENSITIVE);

- anubhava

0

也可以尝试使用这个正则表达式，它只能找到重复的单词

(?i)\\b(\\w+)(\\b\\W+\\b\\1\\b){1,}

- Ryan Berrio Cardona

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mina Wissa · Accepted Answer

尝试这个：

String pattern = "(?i)\\b([a-z]+)\\b(?:\\s+\\1\\b)+";
Pattern r = Pattern.compile(pattern, Pattern.CASE_INSENSITIVE);

String input = "your string";
Matcher m = r.matcher(input);
while (m.find()) {
    input = input.replaceAll(m.group(), m.group(1));
}
System.out.println(input);

Java正则表达式在Pattern类的API文档中有很好的解释。在添加一些空格以表示正则表达式的不同部分后：

"(?i) \\b ([a-z]+) \\b (?: \\s+ \\1 \\b )+"

\b       match a word boundary
[a-z]+   match a word with one or more characters;
         the parentheses capture the word as a group    
\b       match a word boundary
(?:      indicates a non-capturing group (which starts here)
\s+      match one or more white space characters
\1       is a back reference to the first (captured) group;
         so the word is repeated here
\b       match a word boundary
)+       indicates the end of the non-capturing group and
         allows it to occur one or more times