如何在Java正则表达式实现中添加缺失的功能?

30

我是Java的新手。作为一个.Net开发者,我非常习惯于在.Net中使用的Regex类。Java实现的正则表达式(Regex)并不差,但缺少一些关键功能。

我想要为Java创建自己的辅助类,但我想也许已经有可用的了。所以有没有任何免费且易于使用的正则表达式产品可用于Java,或者我应该自己创建一个?

如果我要编写自己的类,您认为我应该在哪里分享它供其他人使用?


[编辑]

有人抱怨我没有解决当前Regex类的问题。我将尝试澄清我的问题。

在.Net中,使用正则表达式比在Java中更容易。由于两种语言都是面向对象的,并且在许多方面非常相似,我期望在两种语言中都能有类似的正则表达式使用体验。不幸的是,情况并非如此。


下面是Java和C#的一些代码对比:

在C#中:

string source = "The colour of my bag matches the color of my shirt!";
string pattern = "colou?r";

foreach(Match match in Regex.Matches(source, pattern))
{
    Console.WriteLine(match.Value);
}

在Java中:

String source = "The colour of my bag matches the color of my shirt!";
String pattern = "colou?r";
Pattern p = Pattern.compile(pattern);
Matcher m = p.matcher(source);

while(m.find())
{
    System.out.println(source.substring(m.start(), m.end()));
}

我试图在上面的示例代码中公平地对待两种语言。

这里首先要注意的是Match类的.Value成员(相比在Java中使用.start().end())。

为什么我要创建两个对象,而不是像Regex.MatchesRegex.Match等静态函数一样调用一个函数呢?

在更高级的用法中,差异表现得更加明显。看看Groups方法、字典长度、CaptureIndexLengthSuccess等方法。这些都是非常必要的功能,在我看来,Java也应该提供这些功能。

当然,所有这些功能都可以通过自定义代理(帮助)类手动添加。这就是我提出这个问题的主要原因。虽然我们没有Perl中的Regex轻松,但至少我们可以使用.Net方法来处理Regex,我认为这设计得非常聪明。


2
你在想什么样的“助手”?可能这个问题已经在当前版本中得到解决,因为它有很多新功能。 - tchrist
这里没有具体的问题列表,涉及到你关心的问题。 - bmargulies
你能否举个例子来更加明确这个问题? - Tim Post
7
source.substring(m.start(), m.end()) 应该与 m.group() 相同。 - Robert
5个回答

124

从您编辑的示例中,我现在可以看到您想要什么。我对此也感到同情。Java的正则表达式与Ruby或Perl中的便利性相差甚远,而且它们几乎总是这样。这是无法解决的,因此我们在Java中将永远被困在这个混乱中。其他JVM语言在这方面做得更好,特别是Groovy。但它们仍然存在一些固有缺陷,并且只能做到这一步。

从哪里开始呢?有所谓的String类的便捷方法:matchesreplaceAllreplaceFirstsplit。这些在小程序中有时可能还可以,取决于您如何使用它们。然而,它们确实有几个问题,而且似乎您已经发现了这些问题。以下是这些问题的部分列表,以及可行和不可行的解决方案。

  1. The inconvenience method is very bizarrely named “matches” but it requires you to pad your regex on both sides to match the entire string. This counter-intuitive sense is contrary to any sense of the word match as used in any previous language, and constantly bites people. Patterns passed into the other 3 inconvenience methods work very unlike this one, because in the other 3, they work like normal patterns work everywhere else; just not in matches. This means you can’t just copy your patterns around, even within methods in the same darned class for goodness’ sake! And there is no find convenience method to do what every other matcher in the world does. The matches method should have been called something like FullMatch, and there should have been a PartialMatch or find method added to the String class.

  2. There is no API that allows you to pass in Pattern.compile flags along with the strings you use for the 4 pattern-related convenience methods of the String class. That means you have to rely on string versions like (?i) and (?x), but those do not exist for all possible Pattern compilation flags. This is highly inconvenient to say the least.

  3. The split method does not return the same result in edge cases as split returns in the languages that Java borrowed split from. This is a sneaky little gotcha. How many elements do you think you should get back in the return list if you split the empty string, eh? Java manufacturers a fake return element where there should be one, which means you can’t distinguish between legit results and bogus ones. It is a serious design flaw that splitting on a ":", you cannot tell the difference between inputs of "" vs of ":". Aw, gee! Don’t people ever test this stuff? And again, the broken and fundamentally unreliable behavior is unfixable: you must never change things, even broken things. It’s not ok to break broken things in Java the wayt it is anywhere else. Broken is forever here.

  4. The backslash notation of regexes conflicts with the backslash notation used in strings. This makes it superduper awkward, and error-prone, too, because you have to constantly add lots of backslashes to everything, and it’s too easy to forget one and get neither warning nor success. Simple patterns like \b\w+\b become nightmares in typographical excess: "\\b\\w+\\b". Good luck with reading that. Some people use a slash-inverter function on their patterns so that they can write that as "/b/w+/b" instead. Other than reading in your patterns from a string, there is no way to construct your pattern in a WYSIWYG literal fashion; it’s always heavy-laden with backslashes. Did you get them all, and enough, and in the right places? If so, it makes it really really hard to read. If it isn’t, you probably haven’t gotten them all. At least JVM languages like Groovy have figured out the right answer here: give people 1st-class regexes so you don’t go nuts. Here’s a fair collection of Groovy regex examples showing how simple it can and should be.

  5. The (?x) mode is deeply flawed. It doesn’t take comments in the Java style of // COMMENT but rather in the shell style of # COMMENT. It doesn’t work with multiline strings. It doesn’t accept literals as literals, forcing the backslash problems listed above, which fundamentally compromises any attempt at lining things up, like having all comments begin on the same column. Because of the backslashes, you either make them begin on the same column in the source code string and screw them up if you print them out, or vice versa. So much for legibility!

  6. It is incredibly difficult — and indeed, fundamentally unfixably broken — to enter Unicode characters in a regex. There is no support for symbolically named characters like \N{QUOTATION MARK}, \N{LATIN SMALL LETTER E WITH GRAVE}, or \N{MATHEMATICAL BOLD CAPITAL C}. That means you’re stuck with unmaintainable magic numbers. And you cannot even enter them by code point, either. You cannot use \u0022 for the first one because the Java preprocessor makes that a syntax error. So then you move to \\u0022 instead, which works until you get to the next one, \\u00E8, which cannot be entered that way or it will break the CANON_EQ flag. And the last one is a pure nightmare: its code point is U+1D402, but Java does not support the full Unicode set using their code point numbers in regexes, forcing you to get out your calculator to figure out that that is \uD835\uDC02 or \\uD835\\uDC02 (but not \\uD835\uDC02), madly enough. But you cannot use those in character classes due to a design bug, making it impossible to match say, [\N{MATHEMATICAL BOLD CAPITAL A}-\N{MATHEMATICAL BOLD CAPITAL Z}] because the regex compiler screws up on the UTF-16. Again, this can never be fixed or it will change old programs. You cannot even get around the bug by using the normal workaround to Java’s Unicode-in-source-code troubles by compiling with java -encoding UTF-8, because the stupid thing stores the strings as nasty UTF-16, which necessarily breaks them in character classes. OOPS!

  7. Many of the regex things we’ve come to rely on in other languages are missing from Java. There are no named groups for examples, nor even relatively-numbered ones. This makes constructing larger patterns out of smaller ones fundamentally error prone. There is a front-end library that allows you to have simple named groups, and indeed this will finally arrive in production JDK7. But even so there is no mechanism for what to do with more than one group by the same name. And you still don’t have relatively numbered buffers, either. We’re back to the Bad Old Days again, stuff that was solved aeons ago.

  8. There is no support a linebreak sequence, which is one of the only two “Strongly Recommended” parts of the standard, which suggests that \R be used for such. This is awkward to emulate because of its variable-length nature and Java’s lack of support for graphemes.

  9. The character class escapes do not work on Java’s native character set! Yes, that’s right: routine stuff like \w and \s (or rather, "\\w" and "\\b") does not work on Unicode in Java! This is not the cool sort of retro. To make matters worse, Java’s \b (make that "\\b", which isn’t the same as "\b") does have some Unicode sensibility, although not what the standard says it must have. So for example a string like "élève" will never in Java match the pattern \b\w+\b, and not merely in entirety per Pattern.matches, but indeed at no point whatsoever as you might get from Pattern.find. This is just so screwed up as to beggar belief. They’ve broken the inherent connection between \w and \b, then misdefined them to boot!! It doesn’t even know what Unicode Alphabetic code points are. This is supremely broken, and they can never fix it because that would change the behavior of existing code, which is strictly forbidden in the Java Universe. The best you can do is create a rewrite library that acts as a front end before it gets to the compile phase; that way you can forcibly migrate your patterns from the 1960s into the 21st century of text processing.

  10. The only two Unicode properties supported are the General Categories and the Block properties. The general category properties only support the abbreviations like \p{Sk}, contrary to the standards Strong Recommendation to also allow \p{Modifier Symbol}, \p{Modifier_Symbol}, etc. You don’t even get the required aliases the standard says you should. That makes your code even more unreadable and unmaintainable. You will finally get support for the Script property in production JDK7, but that is still seriously short of the mininum set of 11 essential properties that the Standard says you must provide for even the minimal level of Unicode support.

  11. Some of the meagre properties that Java does provide are faux amis: they have the same names as official Unicode propoperty names, but they do something altogether different. For example, Unicode requires that \p{alpha} be the same as \p{Alphabetic}, but Java makes it the archaic and no-longer-quaint 7-bit alphabetics only, which is more than 4 orders of magnitude too few. Whitespace is another flaw, since you use the Java version that masquerades as Unicode whitespace, your UTF-8 parsers will break because of their NO-BREAK SPACE code points, which Unicode normatively requires be deemed whitespace, but Java ignores that requirement, so breaks your parser.

  12. There is no support for graphemes, the way \X normally provides. That renders impossible innumerably many common tasks that you need and want to do with regexes. Not only are extended grapheme clusters out of your reach, because Java supports almost none of the Unicode properties, you cannot even approximate the old legacy grapheme clusters using the standard (?:\p{Grapheme_Base}\p{Grapheme_Extend}]*). Not being able to work with graphemes makes even the simplest sorts of Unicode text processing impossible. For example, you cannot match a vowel irrespective of diacritic in Java. The way you do this in a language with grapheme supports varies, but at the very least you should be able to throw the thing into NFD and match (?:(?=[aeiou])\X). In Java, you cannot do even that much: graphemes are beyond your reach. And that means Java cannot even handle its own native character set. It gives you Unicode and then makes it impossible to work with it.

  13. The convenience methods in the String class do not cache the compiled regex. In fact, there is no such thing as a compile-time pattern that gets syntax-checked at compile time — which is when syntax checking is supposed to occur. That means your program, which uses nothing but constant regexes fully understood at compile time, will bomb out with an exception in the middle of its run if you forget a little backslash here or there as one is wont to do due to the flaws previously discussed. Even Groovy gets this part right. Regexes are far too high-level a construct to be dealt with by Java’s unpleasant after-the-fact, bolted-on-the-side model — and they are far too important to routine text processing to be ignored. Java is much too low-level a language for this stuff, and it fails to provide the simple mechanics out of which might yourself build what you need: you can’t get there from here.

  14. The String and Pattern classes are marked final in Java. That completely kills any possibility of using proper OO design to extend those classes. You can’t create a better version of a matches method by subclassing and replacement. Heck, you can’t even subclass! Final is not a solution; final is a death sentence from which there is no appeal.

最后,为了向您展示Java的正则表达式真正有多么脆弱,请考虑以下这个多行模式,其中显示了已经描述的许多缺陷:

   String rx =
          "(?= ^ \\p{Lu} [_\\pL\\pM\\d\\-] + \$)\n"
        + "   # next is a big can't-have set    \n"
        + "(?! ^ .*                             \n"
        + "    (?: ^     \\d+              $    \n"
        + "      | ^ \\p{Lu} - \\p{Lu}     $    \n"
        + "      | Invitrogen                   \n"
        + "      | Clontech                     \n"
        + "      | L-L-X-X    # dashes ok       \n"
        + "      | Sarstedt                     \n"
        + "      | Roche                        \n"
        + "      | Beckman                      \n"
        + "      | Bayer                        \n"
        + "    )      # end alternatives        \n"
        + "    \\b    # only on a word boundary \n"
        + ")          # end negated lookahead   \n"
        ;

你看到这些不自然的东西了吗?你必须在字符串中放入文字换行符;你必须使用非Java注释;你不能使任何东西对齐,因为有额外的反斜杠;你必须使用一些在Unicode上无法正常工作的定义。除此之外还有更多问题。

不仅几乎没有计划修复这些严重缺陷,而且几乎不可能修复它们中的任何一个,因为这会改变旧程序。即使是面向对象设计的正常工具也被禁止使用,因为它已经像死刑一样被锁定,无法修复。

所以,如果你认为Java笨拙的正则表达式对于可靠和方便的正则表达式处理来说太糟糕了,我不能反驳你。抱歉,但事实就是这样。

“在下一个版本中修复!”

仅仅因为有些问题永远无法解决,并不意味着没有任何问题可以解决。它只需要非常小心地去做。以下是我所知道的已经在当前JDK7或拟议的JDK8构建中修复的问题:

  1. 现在支持Unicode脚本属性。您可以使用等效形式之一\p{Script=Greek}\p{sc=Greek}\p{IsGreek}\p{Greek}。这比旧的笨重块属性更好。这意味着您可以执行诸如[\p{Latin}\p{Common}\p{Inherited}]之类的操作,这非常重要。

  2. UTF-16错误有一个解决方法。您现在可以使用数字号码使用\x{⋯}表示任何Unicode代码点,例如\x{1D402}。这甚至可以在字符类内部工作,最终允许[\x{1D400}-\x{1D419}]正常工作。但是,您仍然必须将其双反斜杠化,并且它仅在regexex中有效,而不是在一般字符串中有效。

  3. 现在支持命名组,通过标准符号(?<NAME>⋯)创建它和\k<NAME>进行反向引用。这些仍然对数字组编号有贡献。但是,在同一模式中无法获取其中多个,也不能将其用于递归。

  4. 新的Pattern编译标志Pattern.UNICODE_CHARACTER_CLASSES和相关的可嵌入开关(?U)现在将交换所有定义,例如\w\b\p{alpha}\p{punct},以便它们现在符合Unicode标准所需的这些事物的定义

  5. 现在支持缺失或错误定义的二进制属性\p{IsLowercase}\p{IsUppercase}\p{IsAlphabetic},这些对应于Character类中的方法。这很重要,因为Unicode在纯字母和大小写或字母代码点之间做出了明显而普遍的区分。这些关键属性是那11个绝对需要符合UTS#18“Unicode正则表达式”的一级规定之一,如果没有这些属性,您真的无法使用Unicode。

这些增强和修复非常重要,我很高兴,甚至兴奋地拥有它们。但对于工业级、最先进的正则表达式和/或 Unicode 工作,我不会使用 Java。Java 的 Unicode 模型在 20 年后仍然存在许多缺陷,如果你敢使用 Java 提供的字符集,就无法完成真正的工作。而且 Java 正则表达式只是附加的模型,从未有效过。你必须从头开始,像 Groovy 一样。当然,它可能适用于非常有限的应用程序,其小客户群仅限于英语单语者的农村爱荷华州,没有任何超出旧式电报可以发送的字符的需求。但是有多少项目真的符合这个条件呢?结果比你想象的还要少。正是因为这个原因,某个(显而易见的)价值数十亿美元的国际部署最近刚刚被取消了。Java 的 Unicode 支持——不仅在正则表达式中,在整个过程中——都被证明对于需要可靠地在 Java 中进行国际化来说太弱了。因此,他们被迫从最初计划的全球部署缩减到仅在美国部署。这是绝对偏狭的。不,他们不高兴;你会吗?

Java已经有20年的时间来做得更好,但迄今为止他们显然没有做到,所以我不会抱太大希望。或者说,不要在坏钱上投入好钱;这里的教训是忽略炒作,而是应该进行尽职调查,确保在投资过多之前所有必要的基础设施支持都已经准备就绪。否则,一旦你陷入其中,项目就无法拯救,你也将没有任何真正的选择。

购物需谨慎


5
+1 哇,我知道从来没有花时间学习Java的原因了。现在去继续写下一版《编程 Perl》(http://www.amazon.com/Programming-Perl-3rd-Larry-Wall/dp/0596000278/),这样我就可以购买、阅读并最终学会Perl了!(有这么多语言,时间却这么少...) - ridgerunner
4
起初,这篇文章似乎没有回答我的问题,但是随着我阅读你的答案越来越多,我越来越享受阅读它。非常高深的回答,而且是的,它回答了我的问题。感谢您提供所有信息。在阅读您的答案(或者我应该说是文章)之前,我不知道这些问题的一半。 - Alireza Noori
3
像所有使用Java正则表达式进行非玩具应用程序的人一样,我也为其中一些问题编写了自己的Java前端库。但问题是很多问题只能在编译器本身中解决,这就是Groovy的做法。我建议您尽可能利用您的前端类(很遗憾您不能使用OO!)。祝你好运! - tchrist
1
是的,那就是你给我的答案。感谢你的时间和出色的回答。我在我的小项目中也遇到了一些类似的Unicode问题,但我以为那可能是我的问题。很高兴知道我的编程技能没有错!最终我用一些“技巧”解决了这些问题!你的帖子在很多方面都帮助了我。再次感谢。 - Alireza Noori
1
这可能与主题无关,但是不管怎样我一直在处理Groovy中的正则表达式,然而似乎JRuby必须已经解决了这些问题,因为我想使用的模式在Ruby中可以正常工作。无论如何,只是提供个人意见;-) - will
显示剩余5条评论

12

一个人可以发泄情绪,或者简单地写下:

public class Regex {

    /**
     * @param source 
     *        the string to scan
     * @param pattern
     *        the regular expression to scan for
     * @return the matched 
     */
    public static Iterable<String> matches(final String source, final String pattern) {
        final Pattern p = Pattern.compile(pattern);
        final Matcher m = p.matcher(source);
        return new Iterable<String>() {
            @Override
            public Iterator<String> iterator() {
                return new Iterator<String>() {
                    @Override
                    public boolean hasNext() {
                        return m.find();
                    }
                    @Override
                    public String next() {
                        return source.substring(m.start(), m.end());
                    }    
                    @Override
                    public void remove() {
                        throw new UnsupportedOperationException();
                    }
                };
            }
        };
    }

}

按您的意愿使用:

public class RegexTest {

    @Test
    public void test() {
       String source = "The colour of my bag matches the color of my shirt!";
       String pattern = "colou?r";
       for (String match : Regex.matches(source, pattern)) {
           System.out.println(match);
       }
    }
}

1
在我看来(我认为这得到了“迭代器”定义的强烈支持),hasNext() 不应改变迭代器的状态,而 next() 应该改变状态(例如,缓存结果和状态?)。 - Maarten Bodewes
非常好的观点。我认为可以通过添加一个布尔成员hasNext来实现。调用hasNext()只需返回现有值。然后,在调用source.substring()之后,next()将设置hasNext = m.find()。最后,必须编写一个Iterator构造函数以在第一次设置hasNext = m.find() - Alistair A. Israel
不太确定。我认为你需要重新搜索,如果你已经调用了hasNext,可能需要缓存搜索结果。如果没有搜索,你就不知道是否有下一个值,而实际的next()肯定应该推进搜索。 - Maarten Bodewes

2

@tchrist提到的一些API缺陷已在Kotlin中得到修复。


1

阿里雷扎,我完全理解你的感受!正则表达式本来就很难懂,如果它们之间还有这么多语法变化,那就更加令人困惑了。我也比较常用C#而不是Java编程,遇到了同样的问题。

我发现这个链接非常有帮助: http://www.tusker.org/regex/regex_benchmark.html - 这是一个为Java提供的替代正则表达式实现的列表,并进行了基准测试。


1
请稍微详细解释一下链接。如果那个链接失效了,你的回答也就失去了上下文。 - Tim Post
我自己也不确定为什么这条评论被“踩”了。它看起来就是我想要看到的所有基本软件库和模块的事情。基准测试和选择。 - will

0

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接