我假设数据在每个记录中都以相同的列对齐。我将标题行和典型行分别放在两个变量中,您将从文件中读取它们。
>>> a = 'Column1 Column2 Column3 Column4'
>>> b = 'apple fruits banana fruits orange fruits grapes fruits'
i
是一个索引列表,初始为空,inside
表示我们正在处理列名
>>> i = []
>>> inside = False
我们计算字符并检查是否在列名的开头。
>>> for n, c in enumerate(a):
... if c == ' ':
... inside = False
... continue
... if not inside:
... inside = True
... i.append(n)
>>> i
[0, 18, 38, 58]
我们有列开头的索引,下一列的开始位置在切片表示法中也是当前列的结束位置 --- 我们只需要最后一列的结尾,但使用切片表示法可以使用值
None
。
>>> [b[j:k].rstrip() for j, k in zip(i,i[1:]+[None])]
['apple fruits', 'banana fruits', 'orange fruits', 'grapes fruits']
当然,您需要对输入文件中的每个数据行应用相同的索引技巧。
附注:您可能希望使用
itertools.zip_longest
方法,如下所示。
[... for j, k in itertools.zip_longest(i, i[1:])]
你可能希望缓存生成器以避免为每个数据行实例化它。
cached_indices = list(itertools.zip_longest(i, i[1:]))
for line in data:
c1, c2, c3, c4 = [... for i, j in cached_indices]
我尝试着实现了我在下面评论中提出的建议,这是我最好的努力...
$ cat fetch.py
from itertools import count
from io import StringIO
data = '''\
Column1 Column2 Column3 Column4
----------------------------------------------------------------------------
apple fruits banana fruits orange fruits grapes fruits
mango fruits pineapple fruits blackberry fruits
blueberry fruits currant fruits papaya fruits
chico fruits peach fruits pear fruits
'''
f = StringIO(data)
header = next(f).rstrip()
next(f)
indices = [i for i, c0, c1 in zip(count(), ' '+header, header)
if c0==' ' and c1!=' ']
ranges = list(zip(indices, indices[1:]+[None]))
for nl, line in enumerate(f, 3):
if line == '\n': continue
fields = [line[i:j] for i, j in ranges]
if any((f[0]==' ' and f.rstrip()) or f[-1] not in ' \n' for f in fields):
print('Possible misalignment in line n.%d:'%nl)
print('\t|'+header)
print('\t|'+line.rstrip())
else:
print('Data Line n.%d:'%nl)
fields = [field.rstrip() for field in fields]
for nf, field in enumerate(fields, 1):
print('\tField n.%d:\t%r'%(nf, field))
$ python3 fetch.py
Data Line n.3:
Field n.1: 'apple fruits'
Field n.2: 'banana fruits'
Field n.3: 'orange fruits'
Field n.4: 'grapes fruits'
Data Line n.4:
Field n.1: 'mango fruits'
Field n.2: 'pineapple fruits'
Field n.3: ''
Field n.4: 'blackberry fruits'
Possible misalignment in line n.5:
|Column1 Column2 Column3 Column4
| blueberry fruits currant fruits papaya fruits
Possible misalignment in line n.6:
|Column1 Column2 Column3 Column4
|chico fruits peach fruits pear fruits
$
\t
)了吗?另外,发布一下你用来获取数据的代码可能会有所帮助。 - Aimery