匹配字符串中的所有标识符

Question

匹配字符串中的所有标识符

3

问题:

我正在寻找一种方法，以匹配给定行中以某些单词开头的特定标识符。该ID由字符组成，可能跟随数字，然后是一个破折号，然后是更多数字。只有在起始单词为以下之一时，才应在行上匹配ID：Closes、Fixes、Resolves。如果行包含多个ID，则这些ID将由字符串and分隔。一行上可以存在任意数量的ID。

示例测试字符串:

Closes PD-1                                           # Match: PD-1

Related to PD-2                                       # No match, line doesn't start with an allowed word

Closes                                                
NPD-1                                                 # No match, as the identifier is in a new line

Fixes PD-21 and PD-22                                 # Match: PD-21, PD-22

Closes PD-31, also PD-32 and PD-33                    # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44      # Match: PD4-41, PD4-42, PD4-43, PD4-44

Resolves something related to N-2                     # No match, the identifier is not directly after 'Resolves'

我尝试的方法：

使用正则表达式获取所有匹配项，在某些方面上总是有所不足。例如，我尝试过以下一个正则表达式：

^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*

我打算用非捕获组来匹配以其中一个允许的单词开头的行，并跟随一个空格：^(?:Closes|Fixes|Resolves)
然后，至少需要一个ID跟随起始单词，我打算捕获这个ID：(\w+-\d+)
最后，可以跟随第一个ID的零个或多个ID，它们由字符串and分隔，但我只想在此处捕获ID，而不是分隔符：(?:(?: and )(\w+-\d+))*

Python中此正则表达式的结果为：

test_string = """
Closes PD-1                                           # Match: PD-1
Related to PD-2                                       # No match, line doesn't start with an allowed word
Closes                                                
NPD-1                                                 # No match, as the identifier is in a new line
Fixes PD-21 and PD-22                                 # Match: PD-21, PD-22
Closes PD-31, also PD-32 and PD-33                    # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44      # Match: PD4-41, PD4-42, PD4-43, PD4-44
Resolves something related to N-2                     # No match, the identifier is not directly after 'Resolves'
"""

ids = []

for match in re.findall("^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*", test_string, re.M):
    for group in match:
        if group:
            ids.append(group)

print(ids)
['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-44']

此外，在 regex101.com 上的结果在此处。如果有多个 ID 跟随初始 ID，不幸的是它只捕获最后一个匹配项，而不是全部。据我所知，一个重复捕获组只会捕获最后一次迭代，我应该在重复组周围放置一个捕获组以捕获所有迭代，但是我无法让它工作。

摘要:

是否有通过正则表达式解决此问题的方法？类似于我尝试过的方法，但它可以捕获所有 ID 的出现吗？或者是否有更好的方式使用 Python 解析此字符串中的 ID？

- Bence

3个回答

1

可能采用两阶段方法会更容易，例如：

def get_matches(test):  #assume test is a list of strings
    regex1 = re.compile(r'^(?:Closes|Fixes|Resolves) \w+-\d+')
    regex2 = re.compile(r'\w+-\d+')
    results = []
    for line in test:
        if regex1.search(line):
            results.extend(regex2.findall(line))
    return results

给出：

['PD-1','PD-21','PD-22','PD-31','PD-32', 
'PD-33','PD4-41','PD4-42','PD4-43','PD4-44']

- neutrino_logic

1

如果您需要使用重复捕获组，您应该安装PyPi regex模块并使用pip install regex。

import regex

test_string = "your string here"
ids = []
for match in regex.finditer("^(?:Closes|Fixes|Resolves) (?P<id>\w+-\d+)(?:(?: and )(?P<id>\w+-\d+))*", test_string, regex.M):
    ids.extend(match.captures("id"))
print(ids)
# => ['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-42', 'PD4-43', 'PD4-44']

请查看Python演示

每个组的捕获堆栈可以通过match.captures(X)访问。

你目前使用的正则表达式是可以使用的，但如果在此处使用命名捕获组，则更加用户友好。

- Wiktor Stribiżew

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- The fourth bird · Accepted Answer

你可以使用一个捕获组，在该捕获组中匹配第一次出现，并重复相同的模式0+次，前面加上一个空格，后跟and和空格。

值在第1组中。

要获取单独的值，请在 and 上拆分。

^(?:Closes|Fixes|Resolves) (\w+-\d+(?: and \w+-\d+)*)

正则表达式演示