匹配字符串中的所有标识符

3

问题:

我正在寻找一种方法,以匹配给定行中以某些单词开头的特定标识符。该ID由字符组成,可能跟随数字,然后是一个破折号,然后是更多数字。只有在起始单词为以下之一时,才应在行上匹配ID:Closes、Fixes、Resolves。如果行包含多个ID,则这些ID将由字符串and分隔。一行上可以存在任意数量的ID。

示例测试字符串:

Closes PD-1                                           # Match: PD-1

Related to PD-2                                       # No match, line doesn't start with an allowed word

Closes                                                
NPD-1                                                 # No match, as the identifier is in a new line

Fixes PD-21 and PD-22                                 # Match: PD-21, PD-22

Closes PD-31, also PD-32 and PD-33                    # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44      # Match: PD4-41, PD4-42, PD4-43, PD4-44

Resolves something related to N-2                     # No match, the identifier is not directly after 'Resolves'

我尝试的方法:

使用正则表达式获取所有匹配项,在某些方面上总是有所不足。例如,我尝试过以下一个正则表达式:

^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*

  1. 我打算用非捕获组来匹配以其中一个允许的单词开头的行,并跟随一个空格:^(?:Closes|Fixes|Resolves)
  2. 然后,至少需要一个ID跟随起始单词,我打算捕获这个ID:(\w+-\d+)
  3. 最后,可以跟随第一个ID的零个或多个ID,它们由字符串and分隔,但我只想在此处捕获ID,而不是分隔符:(?:(?: and )(\w+-\d+))*

Python中此正则表达式的结果为:

test_string = """
Closes PD-1                                           # Match: PD-1
Related to PD-2                                       # No match, line doesn't start with an allowed word
Closes                                                
NPD-1                                                 # No match, as the identifier is in a new line
Fixes PD-21 and PD-22                                 # Match: PD-21, PD-22
Closes PD-31, also PD-32 and PD-33                    # Match: PD-31 - the rest is not captured because of ", also"
Resolves PD4-41 and PD4-42 and PD4-43 and PD4-44      # Match: PD4-41, PD4-42, PD4-43, PD4-44
Resolves something related to N-2                     # No match, the identifier is not directly after 'Resolves'
"""

ids = []

for match in re.findall("^(?:Closes|Fixes|Resolves) (\w+-\d+)(?:(?: and )(\w+-\d+))*", test_string, re.M):
    for group in match:
        if group:
            ids.append(group)

print(ids)
['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-44']
此外,在 regex101.com 上的结果在此处。如果有多个 ID 跟随初始 ID,不幸的是它只捕获最后一个匹配项,而不是全部。据我所知,一个重复捕获组只会捕获最后一次迭代,我应该在重复组周围放置一个捕获组以捕获所有迭代,但是我无法让它工作。

摘要:

是否有通过正则表达式解决此问题的方法?类似于我尝试过的方法,但它可以捕获所有 ID 的出现吗?或者是否有更好的方式使用 Python 解析此字符串中的 ID?

3个回答

2
你可以使用一个捕获组,在该捕获组中匹配第一次出现,并重复相同的模式0+次,前面加上一个空格,后跟and和空格。
值在第1组中。
要获取单独的值,请在 and 上拆分。
^(?:Closes|Fixes|Resolves) (\w+-\d+(?: and \w+-\d+)*)

正则表达式演示


1
@Bence 如果你想在捕获组中获取匹配项,可以使用pypi regex模块 https://pypi.org/project/regex/,使用 (?:^(?:Closes|Fixes|Resolves)(?= \w+-\d+)|\G(?!^) (\w+-\d+)(?: and)?) https://regex101.com/r/DeVNCZ/1 - The fourth bird
1
@Bence 一个使用正则表达式模块的 Python 演示 https://rextester.com/VJESC20804 - The fourth bird

1
可能采用两阶段方法会更容易,例如:
def get_matches(test):  #assume test is a list of strings
    regex1 = re.compile(r'^(?:Closes|Fixes|Resolves) \w+-\d+')
    regex2 = re.compile(r'\w+-\d+')
    results = []
    for line in test:
        if regex1.search(line):
            results.extend(regex2.findall(line))
    return results

给出:
['PD-1','PD-21','PD-22','PD-31','PD-32', 
'PD-33','PD4-41','PD4-42','PD4-43','PD4-44']

1
如果您需要使用重复捕获组,您应该安装PyPi regex模块并使用pip install regex
import regex

test_string = "your string here"
ids = []
for match in regex.finditer("^(?:Closes|Fixes|Resolves) (?P<id>\w+-\d+)(?:(?: and )(?P<id>\w+-\d+))*", test_string, regex.M):
    ids.extend(match.captures("id"))
print(ids)
# => ['PD-1', 'PD-21', 'PD-22', 'PD-31', 'PD4-41', 'PD4-42', 'PD4-43', 'PD4-44']

请查看Python演示

每个组的捕获堆栈可以通过match.captures(X)访问。

你目前使用的正则表达式是可以使用的,但如果在此处使用命名捕获组,则更加用户友好。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接