Java正则表达式中的反向引用交集

3
我正在尝试在Java中创建一个正则表达式,以匹配特定单词的模式,以查找具有相同模式的其他单词。例如,单词“tooth”的模式为12213,因为't'和'o'都重复了。我希望正则表达式能够匹配像“teeth”这样的单词。
以下是我的尝试,使用反向引用。在这个特定的例子中,如果第二个字母与第一个字母相同,则应该失败。此外,最后一个字母应该与所有其他字母不同。
String regex = "([a-z])([a-z&&[^\1]])\\2\\1([a-z&&[^\1\2]])";
Pattern p = Pattern.compile(regex);
Matcher m = p.matcher("tooth");

//This works as expected
assertTrue(m.matches());

m.reset("tooto");
//This should return false, but instead returns true
assertFalse(m.matches());

我已经验证过,如果我删除最后一组(即以下内容),它可以在“toot”这样的示例上正常工作,因此我知道反向引用到此为止是有效的:

String regex = ([a-z])([a-z&&[^\1]])\\2\\1";

但是如果我将最后一组添加回模式的末尾,就好像方括号内的反向引用不再被识别。

我是做错了什么,还是这是一个bug?

2个回答

4

试一下这个:

(?i)\b(([a-z])(?!\2)([a-z])\3\2(?!\3)[a-z]+)\b

解释

(?i)           # Match the remainder of the regex with the options: case insensitive (i)
\b             # Assert position at a word boundary
(              # Match the regular expression below and capture its match into backreference number 1
   (              # Match the regular expression below and capture its match into backreference number 2
      [a-z]          # Match a single character in the range between “a” and “z”
   )
   (?!            # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
      \2             # Match the same text as most recently matched by capturing group number 2
   )
   (              # Match the regular expression below and capture its match into backreference number 3
      [a-z]          # Match a single character in the range between “a” and “z”
   )
   \3             # Match the same text as most recently matched by capturing group number 3
   \2             # Match the same text as most recently matched by capturing group number 2
   (?!            # Assert that it is impossible to match the regex below starting at this position (negative lookahead)
      \3             # Match the same text as most recently matched by capturing group number 3
   )
   [a-z]          # Match a single character in the range between “a” and “z”
      +              # Between one and unlimited times, as many times as possible, giving back as needed (greedy)
)
\b             # Assert position at a word boundary

代码

try {
    Pattern regex = Pattern.compile("(?i)\\b(([a-z])(?!\\2)([a-z])\\3\\2(?!\\3)[a-z]+)\\b");
    Matcher regexMatcher = regex.matcher(subjectString);
    while (regexMatcher.find()) {
        for (int i = 1; i <= regexMatcher.groupCount(); i++) {
            // matched text: regexMatcher.group(i)
            // match start: regexMatcher.start(i)
            // match end: regexMatcher.end(i)
        }
    } 
} catch (PatternSyntaxException ex) {
    // Syntax error in the regular expression
}

在这里播放。希望这有所帮助。


你说得对,回溯引用在字符类中不起作用。但是没有必要大声喊出来。 ;) - Alan Moore
抱歉,我很想知道是否有一些我不知道的真正的东西! - Cylian

4
如果您打印您的正则表达式,您会得到一些提示,即您的组中的反向引用实际上由Java转义以产生一些奇怪的字符。因此它不能像预期的那样工作。例如:
m.reset("oooto");
System.out.println(m.matches());

同时也会打印

正确

另外,在正则表达式中,&& 无法使用,必须使用前瞻。下面这个表达式适用于上面的示例:

String regex = "([a-z])(?!\\1)([a-z])\\2\\1(?!(\\1|\\2))[a-z]";

表达式(?!\\1)向前查看,以确保下一个字符不是表达式中的第一个字符,而不会将正则表达式光标向前移动。

当我使用原始正则表达式在“oooto”上运行单元测试时,它返回false,而不是你所说的true。然而,你建议的正则表达式似乎按照我的需求工作。谢谢。 :) - beldenge

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接