在文件对象中匹配多行正则表达式

Question

在文件对象中匹配多行正则表达式

12

我该如何从文件对象（data.txt）中提取此正则表达式的组？

import numpy as np
import re
import os
ifile = open("data.txt",'r')

# Regex pattern
pattern = re.compile(r"""
                ^Time:(\d{2}:\d{2}:\d{2})   # Time: 12:34:56 at beginning of line
                \r{2}                       # Two carriage return
                \D+                         # 1 or more non-digits
                storeU=(\d+\.\d+)
                \s
                uIx=(\d+)
                \s
                storeI=(-?\d+.\d+)
                \s
                iIx=(\d+)
                \s
                avgCI=(-?\d+.\d+)
                """, re.VERBOSE | re.MULTILINE)

time = [];

for line in ifile:
    match = re.search(pattern, line)
    if match:
        time.append(match.group(1))

代码的最后一部分问题在于我逐行迭代，这显然无法处理多行正则表达式。我尝试使用 pattern.finditer(ifile) 像这样：

for match in pattern.finditer(ifile):
    print match

我只是想看看它是否可行，但finditer方法需要一个字符串或缓冲区。

我也尝试过这种方法，但无法使其工作。

matches = [m.groups() for m in pattern.finditer(ifile)]

有什么想法吗？

在Mike和Tuomas的评论后，我被告知要使用.read()。就像这样：

ifile = open("data.txt",'r').read()

这样做没问题，但是这是正确的搜索文件的方式吗？我无法让它工作...

for i in pattern.finditer(ifile):
    match = re.search(pattern, i)
    if match:
        time.append(match.group(1))

解决方案

# Open file as file object and read to string
ifile = open("data.txt",'r')

# Read file object to string
text = ifile.read()

# Close file object
ifile.close()

# Regex pattern
pattern_meas = re.compile(r"""
                ^Time:(\d{2}:\d{2}:\d{2})   # Time: 12:34:56 at beginning of line
                \n{2}                       # Two newlines
                \D+                         # 1 or more non-digits
                storeU=(\d+\.\d+)           # Decimal-number
                \s
                uIx=(\d+)                   # Fetch uIx-variable
                \s
                storeI=(-?\d+.\d+)          # Fetch storeI-variable
                \s
                iIx=(\d+)                   # Fetch iIx-variable
                \s
                avgCI=(-?\d+.\d+)           # Fetch avgCI-variable
                """, re.VERBOSE | re.MULTILINE)

file_times = open("output_times.txt","w")
for match in pattern_meas.finditer(text):
    output = "%s,\t%s,\t\t%s,\t%s,\t\t%s,\t%s\n" % (match.group(1), match.group(2), match.group(3), match.group(4), match.group(5), match.group(6))
    file_times.write(output)
file_times.close()

也许可以更紧凑、更符合Python风格来编写它。

- williamx

2

你确定在回车符中使用\r是正确的吗？你的电脑是预装OS X之前的Mac吗？尝试使用\n或者 (\r?\n)。 - Tim Pietzcker

谢谢！\n似乎能得到更好的结果。 - williamx

1

@william：有一个叫做match.groups()的东西，你可以对其进行切片以跳过第一个参数，或者你可以这样做：match.group(1, 2, 3, 4, 5, 6)。 - SilentGhost

添加 re.DOTALL 参数以使通配符匹配换行符。 - schuess

3个回答

2

times = [match.group(1) for match in pattern.finditer(ifile.read())]

finditer会返回MatchObjects。如果正则表达式没有匹配到任何内容，则times将为空列表。

您还可以修改正则表达式，使用非捕获组来处理storeU、storeI、iIx和avgCI，然后pattern.findall只包含匹配的时间。

注意：变量名time可能会与标准库模块重名。times是更好的选项。

- SilentGhost

通过检查match.group(n)（其中n从1到6）我得到了正确的结果。这意味着正则表达式有效。但是，从您提供的表达式中我没有得到任何结果，只有一个空列表。我已经在文本字符串上尝试过它，可以正常工作，因此很可能是ifile.read()出了问题。有什么提示吗？ - williamx

@william：你需要发布你的主题字符串的示例，并且可能需要在另一个问题中提出。 - SilentGhost

1

为什么不使用缓冲区将整个文件读入？

buffer = open("data.txt").read()

然后用它进行搜索吗？

- Tuomas Pelkonen

1

看起来这是正确的方法！但是我仍然在搜索方面遇到了一些问题... - williamx

这个解决方案似乎在关闭文件时出现了问题.. 不过也许这并不重要 :) - williamx

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mike · Accepted Answer

您可以使用ifile.read()将文件对象中的数据读入字符串中。