Python re查找组匹配的起始和结束索引

Question

Python re查找组匹配的起始和结束索引

5

Python的re匹配对象在匹配对象上具有.start()和.end()方法。我想要找到组匹配的开始和结束索引。我该怎么做？例如：

>>> import re
>>> REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
>>> test = "hello h889p something"
>>> match = REGEX.search(test)
>>> match.group('num')
'889'
>>> match.start()
6
>>> match.end()
11
>>> match.group('num').start()                  # just trying this. Didn't work
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
AttributeError: 'str' object has no attribute 'start'
>>> REGEX.groupindex
mappingproxy({'num': 1})                        # this is the index of the group in the regex, not the index of the group match, so not what I'm looking for.

预期输出为（7, 10）。

- Neil

5个回答

3

给定示例的一种解决方法是使用环视：

import re
REGEX = re.compile(r'(?<=h)[0-9]{3}(?=p)')
test = "hello h889p something"
match = REGEX.search(test)
print(match)

输出

<re.Match object; span=(7, 10), match='889'>

- The fourth bird

2

你可以使用字符串索引和index()方法：

>>> import re
>>> REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
>>> test = "hello h889p something"
>>> match = REGEX.search(test)
>>> test.index(match.group('num')[0])
7
>>> test.index(match.group('num')[-1])
9

如果您想要以元组的形式获得结果：

>>> str_match = match.group("num")
>>> results = (test.index(str_match[0]), test.index(str_match[-1]))
>>> results
(7, 9)

注意：正如Tom指出的，您可能希望考虑使用results = (test.index(str_match), text.index(str_match)+len(str_match))来防止因字符串具有相同字符而引起的错误。例如，如果数字是899，那么results将是(7, 8)，因为第一个9实例在索引8处。

- Jacob Lee

2

您可以使用组名为 Match.start（和 Match.end）提供起始位置（结束位置）：

>>> import re
>>> REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
>>> test = "hello h889p something"
>>> match = REGEX.search(test)
>>> match.start('num')
7
>>> match.end('num')
10

使用这种方法的优点是，与其他答案中建议使用str.index相比，如果组字符串出现多次，您不会遇到问题。

- jfschaefer

1

你正在寻找 re.Match.span。

>>> import re
>>> m = re.match("a(?P<num>[0-9]{3})a", "a123a")
>>> m.span("num")
(1, 4)

- kenballus

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tom · Accepted Answer

使用 index 找到整个组，而非组的起始和结束字符，对现有答案进行轻微修改：

import re
REGEX = re.compile(r'h(?P<num>[0-9]{3})p')
test = "hello h889p something"
match = REGEX.search(test)
group = match.group('num')

# modification here to find the start point
idx = test.index(group)

# find the end point using len of group
output = (idx, idx + len(group)) #(7, 10)

在确定索引时，这将检查整个字符串"889"。因此与仅检查第一个8和第一个9相比，错误的可能性稍小了一些，尽管它仍然不完美（例如，如果"889"出现在较早的位置，而不是被"h"和"p"包围）。