多行正则表达式匹配，检索行号和匹配项

Question

多行正则表达式匹配，检索行号和匹配项

3

我试图迭代文件中的所有行，以匹配可能出现的模式;

可以在文件的任何位置发生
在同一个文件中多次出现
在同一行上多次出现
我正在搜索的字符串可能会为一个正则表达式模式跨越多行

一个示例输入如下;

new File()
new
File()
there is a new File()
new
    
    
    
File()
there is not a matching pattern here File() new
new File() test new File() occurs twice on this line

例如输出结果如下：

new File() Found on line 1  
new File() Found on lines 2 & 3 
new File() Found on line 4 
new File() Found on lines 5 & 9 
new File() Found on line 11
new File() Found on line 11 
6 occurrences of new File() pattern in test.txt (Filename)

正则表达式模式可能如下所示：

pattern = r'new\s+File\s*\({1}\s*\){1}'

查看这里的文档，我可以看到match、findall和finditer都返回字符串开头的匹配项，但我没有看到使用search函数的方法，该函数查找任何位置的正则表达式，在我们搜索的字符串跨越多行时（以上是我提出的第四个要求）。

很容易通过以下方式匹配每一行中出现的多个正则表达式：

输入示例：

line = "new File() new File()"

代码：

i = 0
matches = []
while i < len(line):
    while line:
        matchObj = re.search(r"new\s+File\s*\({1}\s*\){1}", line, re.MULTILINE | re.DOTALL)
        if matchObj:
            line = line[matchObj.end():]
            matches.append(matchObj.group())

print(matches)

打印以下匹配项 - 目前不包括行号等：

['new File()', 'new File()']

有没有办法使用Python的正则表达式来实现我想要的功能？

- Michael Heneghan

2个回答

1

你可以先找到文本中所有\n字符及其相应的位置/字符索引。由于每个\n...嗯...都会开始新的一行，此列表中每个值的索引表示找到的\n字符终止的行号。然后搜索所有匹配模式的出现次数，并使用上述列表查找匹配的起始/结束位置...

import re
import bisect

text = """new 
File()
aa new File()
new
File()
there is a new File() and new
File() again
new
    
    
    
File()
there is not a matching pattern here File() new
new File() test new File() occurs twice on this line
"""

# character indices of all \n characters in text
nl = [m.start() for m in re.finditer("\n", text, re.MULTILINE|re.DOTALL)]

matches = list(re.finditer(r"(new\s+File\(\))", text, re.MULTILINE|re.DOTALL))
match_count = 0
for m in matches:
    match_count += 1
    r = range(bisect.bisect(nl, m.start()-1), bisect.bisect(nl, m.end()-1)+1)
    print(re.sub(r"\s+", " ", m.group(1), re.DOTALL), "found on line(s)", *r)
print(f"{match_count} occurrences of new File() found in file....")

输出：

new File() found on line(s) 0 1
new File() found on line(s) 2
new File() found on line(s) 3 4
new File() found on line(s) 5
new File() found on line(s) 5 6
new File() found on line(s) 7 8 9 10 11
new File() found on line(s) 13
new File() found on line(s) 13
8 occurrences of new File() found in file....

- mrxra

请注意，re.MULTILINE|re.DOTALL 在这里是多余的，因为没有任何 ., ^ 和 $ 模式需要使用这些选项来修改其行为。 - Wiktor Stribiżew

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Wiktor Stribiżew · Accepted Answer

你可以在匹配前计算换行符的数量，然后计算匹配值中换行符的数量，并合并行号：参见Python演示：

import re
s='new File()\nnew\nFile()\nthere is a new File()\nnew\n \n \n \nFile()\nthere is not a matching pattern here File() new\nnew File() test new File() occurs twice on this line'
pattern = r'new\s+File\s*\(\s*\)'
for m in re.finditer(pattern, s):
    linenums = [s[:m.start()].count('\n') + 1]
    for _ in range(m.group().count('\n')):
        linenums.append(linenums[-1] + 1)
    print('{} Found on line {}'.format(re.sub(r'\s+', ' ', m.group()), ", ".join(map(str,linenums))))

请看在线Python演示。

输出：

new File() Found on line 1
new File() Found on line 2, 3
new File() Found on line 4
new File() Found on line 5, 6, 7, 8, 9
new File() Found on line 11
new File() Found on line 11