如何从文本文件中仅读取特定行？

Question

如何从文本文件中仅读取特定行？

4

我正在尝试处理存储在文本文件中的数据，它看起来像这样 test.dat:

-1411.85  2.6888   -2.09945   -0.495947   0.835799   0.215353   0.695579   
-1411.72  2.82683   -0.135555   0.928033   -0.196493   -0.183131   -0.865999   
-1412.53  0.379297   -1.00048   -0.654541   -0.0906588   0.401206   0.44239   
-1409.59  -0.0794765   -2.68794   -0.84847   0.931357   -0.31156   0.552622   
-1401.63  -0.0235102   -1.05206   0.065747   -0.106863   -0.177157   -0.549252   
....
....

这个文件大小达数GB, 我希望可以将它分成小块逐行读取。我想要使用NumPy中的loadtxt函数，因为其可以快速将所有数据转换为NumPy数组。然而目前我还无法做到，因为该函数似乎只提供了按列选择的功能，就像这里所示：

data = np.loadtxt("test.dat", delimiter='  ', skiprows=1, usecols=range(1,7))

有没有任何想法可以实现这个功能？如果使用loadtxt无法实现，还有其他在Python中可用的选项吗？

- user4290866

loadtxt函数的fname参数可以是一个生成器，所以要读取小块行数时，可以使用文件读取生成器，例如nosklo在https://dev59.com/KnRB5IYBdhLWcg3wxZ7Y中给出的答案，但需将其修改为仅读取少量行而不是字节数。 - user4322779

1

参见：https://dev59.com/xIbca4cB1Zd3GeqPcO1t#27962976 - 使用numpy的genfromtxt读取每n行的最快方法 - hpaulj

3个回答

1

hpaulj在他的评论中指出了正确的方向。

对我来说，使用以下代码完美地解决了问题：

import numpy as np
import itertools
with open('test.dat') as f_in:
    x = np.genfromtxt(itertools.islice(f_in, 1, 12, None), dtype=float)
    print x[0,:]

非常感谢！

- user4290866

0

你可能想要使用itertools的一个配方。

from itertools import izip_longest
import numpy as np


def grouper(n, iterable, fillvalue=None):
    args = [iter(iterable)] * n
    return izip_longest(fillvalue=fillvalue, *args)


def lazy_reader(fp, nlines, sep, skiprows, usecols):
    with open(fp) as inp:
        for chunk in grouper(nlines, inp, ""):
            yield np.loadtxt(chunk, delimiter=sep, skiprows=skiprows, usecols=usecols)

该函数返回一个数组生成器。

lazy_data = lazy_reader(...)
next(lazy_data)  # this will give you the next chunk
# or you can iterate 
for chunk in lazy_data:
    ...

- Eli Korvigo

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- yangjie · Accepted Answer

如果您能使用pandas，那会更容易：

In [2]: import pandas as pd

In [3]: df = pd.read_table('test.dat', delimiter='  ', skiprows=1, usecols=range(1,7), nrows=3, header=None)

In [4]: df.values
Out[4]:
array([[ 2.82683  , -0.135555 ,  0.928033 , -0.196493 , -0.183131 ,
        -0.865999 ],
       [ 0.379297 , -1.00048  , -0.654541 , -0.0906588,  0.401206 ,
         0.44239  ],
       [-0.0794765, -2.68794  , -0.84847  ,  0.931357 , -0.31156  ,
         0.552622 ]])

编辑

如果您想要每隔k行读取一次，请指定chunksize。例如：

reader = pd.read_table('test.dat', delimiter='  ', usecols=range(1,7), header=None, chunksize=2)
for chunk in reader:
    print(chunk.values)

输出：

[[ 2.6888   -2.09945  -0.495947  0.835799  0.215353  0.695579]
 [ 2.82683  -0.135555  0.928033 -0.196493 -0.183131 -0.865999]]
[[ 0.379297  -1.00048   -0.654541  -0.0906588  0.401206   0.44239  ]
 [-0.0794765 -2.68794   -0.84847    0.931357  -0.31156    0.552622 ]]
[[-0.0235102 -1.05206    0.065747  -0.106863  -0.177157  -0.549252 ]]

你需要处理如何按照你的意愿在for循环中存储它们。请注意，在这种情况下，reader是一个TextFileReader而不是DataFrame，所以你可以懒惰地迭代它。

你可以阅读this以获取更多详细信息。