如何制作平衡组捕获?

5

假设我有一个文本输入框。

 tes{}tR{R{abc}aD{mnoR{xyz}}}

我想提取以下输出:

 R{abc}
 R{xyz}
 D{mnoR{xyz}}
 R{R{abc}aD{mnoR{xyz}}}

目前,我只能使用平衡组方法从{ }组中提取内容,该方法在MSDN中有详细说明。以下是模式:

 ^[^{}]*(((?'Open'{)[^{}]*)+((?'Target-Open'})[^{}]*)+)*(?(Open)(?!))$

有人知道如何在输出中包含 R{} 和 D{} 吗?

你使用的是哪种编程语言? - sp00m
我将在C#中使用它,但我使用Expresso进行测试。到目前为止,它返回相同的结果。 - sessionista
2
“R{abc}aD{mnoR{xyz}}” 应该是 “R{R{abc}aD{mnoR{xyz}}}" 才对吧? - Jerry
3个回答

3
我认为在这里需要采用不同的方法。一旦匹配第一个较大的组R{R{abc}aD{mnoR{xyz}}}(请参见我关于可能的拼写错误的评论),您将无法获取其中的子组,因为正则表达式不允许您捕获单个R{ ... }组。
因此,必须有一种捕获而不是消耗的方式,显然可以使用正向预查。从那里开始,您可以使用您使用的表达式,尽管需要进行一些更改以适应新的焦点变化,我得出了以下结论:
(?=([A-Z](?:(?:(?'O'{)[^{}]*)+(?:(?'-O'})[^{}]*?)+)+(?(O)(?!))))

我把“Open”改成了“O”,并删除了右大括号的命名捕获,使其更短,并避免匹配中的噪音。在regexhero.net上(目前我所知唯一的免费.NET正则表达式测试器),我得到了以下捕获组:
1: R{R{abc}aD{mnoR{xyz}}}
1: R{abc}
1: D{mnoR{xyz}}
1: R{xyz}

正则表达式的分解:

(?=                         # Opening positive lookahead
    ([A-Z]                  # Opening capture group and any uppercase letter (to match R & D)
        (?:                 # First non-capture group opening
            (?:             # Second non-capture group opening
                (?'O'{)     # Get the named opening brace
                [^{}]*      # Any non-brace
            )+              # Close of second non-capture group and repeat over as many times as necessary
            (?:             # Third non-capture group opening
                (?'-O'})    # Removal of named opening brace when encountered
                [^{}]*?     # Any other non-brace characters in case there are more nested braces
            )+              # Close of third non-capture group and repeat over as many times as necessary
        )+                  # Close of first non-capture group and repeat as many times as necessary for multiple side by side nested braces
        (?(O)(?!))          # Condition to prevent unbalanced braces
    )                       # Close capture group
)                           # Close positive lookahead

以下内容在C#中不起作用

实际上,我想尝试一下在PCRE引擎上的工作方式,因为有递归正则表达式选项,我认为这更容易,因为我更熟悉它,并且产生了一个更短的正则表达式 :)

(?=([A-Z]{(?:[^{}]|(?1))+}))

regex101 demo

(?=                    # Opening positive lookahead
    ([A-Z]             # Opening capture group and any uppercase letter (to match R & D)
        {              # Opening brace
            (?:        # Opening non-capture group
                [^{}]  # Matches non braces
            |          # OR
                (?1)   # Recurse first capture group
            )+         # Close non-capture group and repeat as many times as necessary
        }              # Closing brace
    )                  # Close of capture group
)                      # Close of positive lookahead

太棒了!它起作用了。在C#的Regex类中,如果使用Matches方法,它将返回一组匹配对象。如果调用match.Group[1].Value,则可以在每个匹配对象中找到结果。 - sessionista
@sessionista 是的。很高兴能帮到你 ^^ - Jerry
顺便问一下,你用了什么参考来分析表达式的逐步调试?评论很好。当我尝试自己进行分析时,我仍然感到迷失,这意味着仍有一些概念不是很清晰。 - sessionista
@sessionista 我自己也是新手,对于.NET处理嵌套的方式还不太熟悉^^; 我刚刚阅读了你提供的msdn页面上的小节和文档。虽然花费了我相当长的时间和试错才得到了我现在所掌握的知识! - Jerry

0
在 .Net 正则表达式中,平衡组让您可以精确控制要捕获的内容,而且 .Net 正则表达式引擎会保留该组所有捕获历史记录(与大多数其他语言不同,它们只捕获每个组的最后一个出现)。
MSDN 的示例有点复杂。匹配嵌套结构的更简单方法是:
(?>
    (?<O>)\p{Lu}\{   # Push to the O stack, and match an upper-case letter and {
    |                # OR
    \}(?<-O>)        # Match } and pop from the stack
    |                # OR
    \p{Ll}           # Match a lower-case letter
)+
(?(O)(?!))        # Make sure the stack is empty

或者在一行中:
(?>(?<O>)\p{Lu}\{|\}(?<-O>)|\p{Ll})+(?(O)(?!))

Regex Storm上的工作示例

在你的示例中,它还匹配了字符串开头的"tes",但不用担心,我们还没有完成。

通过进行小修正,我们也可以捕获R{...}对之间出现的内容:

(?>(?<O>)\p{Lu}\{|\}(?<Target-O>)|\p{Ll})+(?(O)(?!))

每个Match都会有一个名为"Target"Group,每个这样的Group都会有一个Capture,对于每个出现的匹配项,您只需要关心这些捕获。

Regex Storm上的工作示例 - 点击Table选项卡并检查${Target}的4个捕获

另请参阅:


0
我不确定单个正则表达式能够满足您的需求:这些嵌套子字符串总是会让它混乱不清。
一个解决方案可能是以下算法(用Java编写,但我想将其翻译成C#应该不难):
/**
 * Finds all matches (i.e. including sub/nested matches) of the regex in the input string.
 * 
 * @param input
 *          The input string.
 * @param regex
 *          The regex pattern. It has to target the most nested substrings. For example, given the following input string
 *          <code>A{01B{23}45C{67}89}</code>, if you want to catch every <code>X{*}</code> substrings (where <code>X</code> is a capital letter),
 *          you have to use <code>[A-Z][{][^{]+?[}]</code> or <code>[A-Z][{][^{}]+[}]</code> instead of <code>[A-Z][{].+?[}]</code>.
 * @param format
 *          The format must follow the <a href= "http://docs.oracle.com/javase/7/docs/api/java/util/Formatter.html#syntax" >format string
 *          syntax</a>. It will be given one single integer as argument, so it has to contain (and to contain only) a <code>%d</code> flag. The
 *          format must not be foundable anywhere in the input string. If <code>null</code>, <code>ééé%dèèè</code> will be used.
 * @return The list of all the matches of the regex in the input string.
 */
public static List<String> findAllMatches(String input, String regex, String format) {

    if (format == null) {
        format = "ééé%dèèè";
    }
    int counter = 0;
    Map<String, String> matches = new LinkedHashMap<String, String>();
    Pattern pattern = Pattern.compile(regex);
    Matcher matcher = pattern.matcher(input);

    // if a substring has been found
    while (matcher.find()) {
        // create a unique replacement string using the counter
        String replace = String.format(format, counter++);
        // store the relation "replacement string --> initial substring" in a queue
        matches.put(replace, matcher.group());
        String end = input.substring(matcher.end(), input.length());
        String start = input.substring(0, matcher.start());
        // replace the found substring by the created unique replacement string
        input = start + replace + end;
        // reiterate on the new input string (faking the original matcher.find() implementation)
        matcher = pattern.matcher(input);
    }

    List<Entry<String, String>> entries = new LinkedList<Entry<String, String>>(matches.entrySet());

    // for each relation "replacement string --> initial substring" of the queue
    for (int i = 0; i < entries.size(); i++) {
        Entry<String, String> current = entries.get(i);
        // for each relation that could have been found before the current one (i.e. more nested)
        for (int j = 0; j < i; j++) {
            Entry<String, String> previous = entries.get(j);
            // if the current initial substring contains the previous replacement string
            if (current.getValue().contains(previous.getKey())) {
                // replace the previous replacement string by the previous initial substring in the current initial substring
                current.setValue(current.getValue().replace(previous.getKey(), previous.getValue()));
            }
        }
    }

    return new LinkedList<String>(matches.values());
}

因此,在您的情况下:
String input = "tes{}tR{R{abc}aD{mnoR{xyz}}}";
String regex = "[A-Z][{][^{}]+[}]";
findAllMatches(input, regex, null);

返回:
R{abc}
R{xyz}
D{mnoR{xyz}}
R{R{abc}aD{mnoR{xyz}}}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接