如何在Python中基于注释块拆分文本文件？

Question

如何在Python中基于注释块拆分文本文件？

5

我在早上浪费了大部分时间，都未能解决这个简单的问题。使用Python，我想要解析类似于这样的数据文件：

# This is an example comment line, it starts with a '#' character.
# There can be a variable number of comments between each data set.
# Comments "go with" the data set that comes after them.
# The first data set starts on the next line:
0.0 1.0
1.0 2.0
2.0 3.0
3.0 4.0

# Data sets are followed by variable amounts of white space.
# The second data set starts after this comment
5.0 6.0
6.0 7.0


# One more data set.
7.0 8.0
8.0 9.0

我需要一个Python代码，将上述示例解析为三个“块”，并将它们作为列表元素存储。单独的代码块可以自行存储为行的列表，带或不带注释行。一种手动的方法是这样做：

#! /usr/bin/env python

# Read in data, seperate into rows_alldata
f=open("example")
rows = f.read().split('\n')
f.close()

# Do you haz teh codez?
datasets=[]
datasets.append(rows[0:8])
datasets.append(rows[9:13])
datasets.append(rows[15:18])

我正在寻找一种更通用的解决方案，支持不同数量和长度的数据集。我尝试了几个基于非Python风格循环的灾难。我认为最好不要在我的问题中混杂它们，因为这是工作而不是"家庭作业"。

- Douglas B. Staple

数据集是否总是以字符串形式存储？ - Jordan Kaye

数据是原始文本，但最终我会将其解析为浮点数。 - Douglas B. Staple

你知道吗...再看一遍，我认为在我给出的例子中，最容易的方法是根据数据集之间的空格块进行拆分。 - Douglas B. Staple

4个回答

3

datasets = [[]]
with open('/tmp/spam.txt') as f:
  for line in f:
    if line.startswith('#'):
      if datasets[-1] != []:
        # we are in a new block
        datasets.append([])
    else:
      stripped_line = line.strip()
      if stripped_line:
        datasets[-1].append(stripped_line)

- wim

这正好做我想要的。 - Douglas B. Staple

1

很高兴听到这个消息。如果你有numpy，我建议你尝试使用np.loadtxt来更轻松地解析你的浮点数。 - wim

1

import pprint

with open("test.txt") as fh:
    codes = []
    codeblock = []

    for line in fh:
        stripped_line = line.strip()

        if not stripped_line:
            continue

        if stripped_line.startswith("#"):
            if codeblock:
                codes.append(codeblock)
                codeblock = []

        else:
            codeblock.append(stripped_line.split(" "))

    if codeblock:
        codes.append(codeblock)

pprint.pprint(codes)

输出：

[[['0.0', '1.0'], ['1.0', '2.0'], ['2.0', '3.0'], ['3.0', '4.0']],
 [['5.0', '6.0'], ['6.0', '7.0']],
 [['7.0', '8.0'], ['8.0', '9.0']]]

- Vikas

这个也可以用，虽然我认为它不像其他解决方案那样优雅。 - Douglas B. Staple

-1

datasets = []
with open('example') as f:
    for line in f:
        if line and not line.startswith('#'):
            datasets.append(line.split())

- aychedee

应该是 for line in f。 - Fred Foo

1

这样做不能保持数据集的分离。@larsmans 在for循环中还缺少一个冒号。 - Douglas B. Staple

糟糕，我匆忙地完成了它，然后不得不去做一些工作，现在已经修复了。 - aychedee

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Fred Foo · Accepted Answer

使用groupby函数。

from itertools import groupby

def contains_data(ln):
    # just an example; there are smarter ways to do this
    return ln[0] not in "#\n"

with open("example") as f:
    datasets = [[ln.split() for ln in group]
                for has_data, group in groupby(f, contains_data)
                if has_data]