Java正则表达式：匹配两个冒号之间的字符串

Question

Java正则表达式：匹配两个冒号之间的字符串

4

我正在尝试编写一个Java正则表达式，它将查找两个:之间的所有字符串。如果字符之间的字符串包含空格、换行符或制表符，则应忽略它。空字符串也被忽略。_是可以的！该组可以包括封闭的:，也可以不包括。

以下是一些测试和预期的结果组：

"test :candidate: test" => ":candidate:"
"test :candidate: test:" => ":candidate:"
"test :candidate:_test:" => ":candidate:", ":_test:"
"test :candidate::test" => ":candidate:"
"test ::candidate: test" => ":candidate:"
"test :candidate_: :candidate: test" => ":candidate_:", ":candidate:"
"test :candidate_:candidate: test" => ":candidate_:", ":candidate:"

我测试了很多正则表达式，这些几乎都有效：

":(\\w+):"
":[^:]+:"

当这两个组共享一个冒号时，我仍然遇到问题：

"test :candidate_: :candidate: test" => ":candidate_:", ":candidate:" // OK
"test :candidate_:candidate: test" => ":candidate_:" // ERROR! :(

看起来第一组“消耗”了第二个冒号，导致匹配器无法找到我期望的第二个字符串。

有人能指点我解决这个问题的正确方向吗？你能否详细说明为什么匹配器会“消耗”冒号？

谢谢。

- Vincent Durmont

你有考虑过以不同的方式处理字符串吗？与其使用匹配器等工具，大部分工作都可以通过String#split函数完成。 - Valentin Ruano

是的，我考虑过使用 split()，但我想了解如何使用正则表达式来完成它。如果找不到正则表达式的解决方案，我认为我会使用这种方法。 - Vincent Durmont

你在使用哪种正则表达式方法？你需要在一个匹配组中捕获每个以 : 分隔的区域，是吗？ - Patrick Collins

@PatrickCollins 我需要要么都有 :，要么都没有。 - Vincent Durmont

3个回答

4

那么String.split()怎么样？

String invalidChars = " |\t|\r|\f|\n"; // regex for invalid characters

String testStr = "test :candidate:_test:";
String[] parts = testStr.Split(":");
List<String> results = new ArrayList<String>();
for (String part : parts)
{
    if (part.matches(invalidChars) || part.isEmpty()) continue;
    results.add(part);
}

results应包含candidate和_test。

- Jashaszun

@VincentDurmont 你应该看一下 hwnd 的回答。虽然我没有测试过，但它看起来也是正确的。 - Jashaszun

你可以使用String#matches来简化代码（避免内部for循环），以查找是否存在任何无效字符。 - Valentin Ruano

我会将.equals("")替换为.isEmpty()。 - Valentin Ruano

@PatrickCollins 你为什么说正则表达式不是正确的工具？（只是想了解它的工作原理） - Vincent Durmont

虽然您的问题是针对Patrick Collins提出的，但我也会回答这个问题：我认为在这里不需要正则表达式，因为有一个更好的解决方案（在我看来）：split。我宁愿使用除正则表达式之外的其他东西，因为根据我的经验，正则表达式可能会变得非常复杂，而我回答中的代码仍然易于理解。 - Jashaszun

显示剩余2条评论

1

使用正则表达式替换清理输入，然后使用分割函数(split)就可以在一行代码中完成整个任务:

String[] terms = input.replaceAll("(?s)^.*?:|:[^:]*$", "").split("(?s):([^:]*\\s[^:]*:)?");

这适用于您所有的边缘情况，方法是：

从头部和尾部（包括前导/尾随冒号）删除输入
在冒号上分割，可选择跟随垃圾和另一个冒号
"dotall"标志(?s)使其可以跨多行工作

Here's some test code:

String[] inputs =  {
        "foo:target1:bar",
        "foo:target1:target2:bar",
        "foo:target1:target2:target3:bar",
        "foo:target1:junk junk:target2:bar" ,
};
for (String input : inputs) {
    String[] terms = input.replaceAll("(?s)^.*?:|:[^:]*$", "").split("(?s):([^:]*\\s[^:]*:)?");
    System.out.println(Arrays.toString(terms));
}

输出：

[target1]
[target1, target2]
[target1, target2, target3]
[target1, target2]

- Bohemian

我不确定这是否符合 OP 的要求，因为 target2 被忽略了（或者实际上是 .split(":(.*:)?") 的一部分，所以它被消耗掉了，只留下了 target1 和 target3。 - Pshemo

@Pshemo 是的 - 这就是它的作用。示例输入是这段代码不处理的内容。它确实可以处理 "foo:target1:junk:target2:bar" 和 "foo:target1:target2:bar"。 - Bohemian

如果我正确理解 OP，junkxxx 实际上是正确的值，因为其中没有空格，所以在我的看法中，它应该包含在结果中（但我可能错了，OP必须判断这个解决方案）。 - Pshemo

@Pshemo 我知道这可以在一行代码中完成（请参见编辑）。感谢您的指导 :) - Bohemian

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- hwnd · Accepted Answer

5

使用正向先行断言来捕获重叠的匹配项。

(?=(:\\w+:))

注意: 您可以通过引用捕获组#1来访问您的匹配结果 (演示链接)

- hwnd

我更倾向于使用 :\\w+(?=(:))，但这个也可以。+1 但你可能需要更明确地指出匹配的部分将在第一组中。 - Pshemo

@VincentDurmont 我的解决方案需要将第0组和第1组的匹配结果合并（如果您还想在匹配中包含最后一个:）。在这个解决方案中，您只需要使用第1组。但是，如果您不想在匹配中包含:，那么您可以使用(?<=:)\\w+(?=:)，并从第0组获取结果。 - Pshemo

@pshemo 或者总有一行代码的忍者版本（如下）:) - Bohemian