比循环整个文件更好的正则表达式实现方式是什么？

Question

比循环整个文件更好的正则表达式实现方式是什么？

3

我有这样的文件:

#     BJD     K2SC-Flux EAPFlux   Err  Flag Spline
2457217.463564 5848.004 5846.670 6.764 0 0.998291
2457217.483996 6195.018 6193.685 6.781 1 0.998291
2457217.504428 6396.612 6395.278 6.790 0 0.998292
2457217.524861 6220.890 6219.556 6.782 0 0.998292
2457217.545293 5891.856 5890.523 6.766 1 0.998292
2457217.565725 5581.000 5579.667 6.749 1 0.998292
2457217.586158 5230.566 5229.232 6.733 1 0.998292
2457217.606590 4901.128 4899.795 6.718 0 0.998293
2457217.627023 4604.127 4602.793 6.700 0 0.998293

我需要找到并计算Flag = 1的行数（第5列）。以下是我的做法：

foundlines=[]
c=0
import re
with open('examplefile') as f:
    for index, line in enumerate(f):
        try:
            found = re.findall(r' 1 ', line)[0]
            foundlines.append(index)
            print(line)
            c+=1
        except:
            pass
print(c)

在Shell中，我只需要执行grep " 1 " examplefile | wc -l就可以了，这比上面的Python脚本要短得多。虽然Python脚本也能工作，但我想知道是否有一种比上面的脚本更短、更紧凑的方法来完成这个任务？我更喜欢Shell的简洁性，所以我想在Python中找到类似的东西。

- zabop - we're hiring

1

由于代码可以正常工作，您应该考虑将其发布在[codereview.se]上。然而，很明显您不需要使用正则表达式在空格之间查找1，只需使用if ' 1 ' in line即可。 - Wiktor Stribiżew

1

Python 中的大多数东西都可以放在一行上，但这会严重损害可读性。你确定尺寸是你唯一关心的事情吗？ - Mast

2

如果你喜欢简短的话，就坚持使用Shell。 - Mast

是的，如果可读性严重降低，那么我不介意它变得冗长。好的，我会重新考虑Shell实现！ - zabop

3个回答

1

您的shell实现可以更简短，使用grep的-c选项可以获取计数，无需使用匿名管道和wc:

grep -c " 1 " examplefile

你的Shell代码只会给出匹配到模式1的行数，但是你的Python代码还会保留匹配到模式的行索引列表。

如果只需要得到行数，你可以使用sum和genexp/list comprehension，也不需要正则表达式；因为字符串是可迭代的，所以简单的字符串__contains__检查就足够了。

with open('examplefile') as f:
    count = sum(1 for line in f if ' 1 ' in line)
    print(count)

如果您想保留索引，可以继续使用您的想法，只需将re测试替换为str测试即可：

count = 0
indexes = []
with open('examplefile') as f:
    for idx, line in enumerate(f):
        if ' 1 ' in line:
            count += 1
            indexes.append(idx)

此外，使用裸的except几乎总是不明智的（至少应该使用except Exception来省略像SystemExit、KeyboardInterrupt这样的异常），只捕获你知道可能会引发的异常。

另外，在解析结构化数据时，应使用特定的工具，例如在此处使用带有空格分隔符的csv.reader（在这种情况下，line.split(' ')也可以），并检查索引-4将是最安全的（请参见Tomalak's answer）。使用' 1 ' in line测试，如果任何其他列包含1，将会产生误导性的结果。

考虑到上述问题，以下是使用awk匹配第5个字段的shell方式：

awk '$5 == "1" {count+=1}; END{print count}' examplefile

- heemayl

"...如果line中包含'1'，那么它是不可靠的。" - Tomalak

@Tomalak 我承认 ;) 实际上是按照字面意思的例子。 - heemayl

1

那么，您至少应该警告有关误报风险的风险。 - Tomalak

1

最短代码

在某些特定前提条件下，这是一个非常简短的版本：

您只想像grep命令一样计算出现次数
每行中保证只有一个" 1 "
" 1 "只能出现在所需列中
您的文件可以轻松地放入内存中

请注意，如果不满足这些前提条件，可能会导致内存问题或返回错误结果。

print(open("examplefile").read().count(" 1 "))

易于使用且多功能，略微长一些

当然，如果您有意愿稍后对这些行进行实际操作，我建议使用Pandas：

df = pandas.read_table('test.txt', delimiter=" ",
                       comment="#",
                       names=['BJD', 'K2SC-Flux', 'EAPFlux', 'Err', 'Flag', 'Spline'])

获取Flag为1的所有行：

flaggedrows = df[df.Flag == 1]

返回：

            BJD  K2SC-Flux   EAPFlux    Err  Flag    Spline
1  2.457217e+06   6195.018  6193.685  6.781     1  0.998291
4  2.457218e+06   5891.856  5890.523  6.766     1  0.998292
5  2.457218e+06   5581.000  5579.667  6.749     1  0.998292
6  2.457218e+06   5230.566  5229.232  6.733     1  0.998292

要统计它们：

print(len(flaggedrows))

返回 4

- chthonicdaemon

这里是否存在误报的危险，就像这里提到的一样？https://dev59.com/g67la4cB1Zd3GeqPXh53#a_PqnYgBc1ULPQZFRyKt - zabop

@heemayl 我已经添加了注意事项和更好的答案。 - chthonicdaemon

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tomalak · Accepted Answer

如果您有CSV数据，可以使用csv模块:

import csv

with open('your file', 'r', newline='', encoding='utf8') as fp:
    rows = csv.reader(fp, delimiter=' ')

    # generator comprehension
    errors = (row for row in rows if row[4] == '1')

for error in errors:
    print(error)