如何从特定行开始迭代文件？

Question

如何从特定行开始迭代文件？

4

我正在使用enumerate()迭代文件的行，有时需要从特定的文件行开始迭代，因此我尝试使用testfile.seek()，例如，如果我想重新从第10行开始迭代文件，则使用 testfile.seek(10)：

test_file.seek(10)

for i, line in enumerate(test_file):
    …

然而，test_file 总是从第一行0开始迭代。我做错了什么？为什么 seek() 不起作用？如果有更好的实现方法，也请提供。

先行感谢并会确保点赞/接受答案。

- Jo Ko

3

seek(10) 不是将文件定位到第 10 个字节吗？ - Eric Duminil

你有读过关于 seek 方法的文档吗？ - iafisher

我认为明确表达你特别关注效率是明智的。这样，你可能更有可能得到关于linecache / islice等最快选项的答案。 - axwr

可能是Python最快访问文件中的行的重复问题。 - Mad Physicist

5个回答

3

一种使用Python的本地方法是使用zip来迭代不必要的行。

with open("text.txt","r") as test_file:
    for _ in zip(range(10), test_file): pass
    for i, line in enumerate(test_file,start=10):
        print(i, line)

- Neil

2

个人而言，我只会使用 if 语句。虽然比较基础，但至少非常容易理解。

with open("file") as fp:
for i, line in enumerate(fp):
    if i >= 10:
        # do stuff.

编辑：islice：这里进行的比较：Python最快访问文件中的行比我能做的更好。结合itertools手册：https://docs.python.org/2/library/itertools.html，我认为你不需要更多的东西。

- axwr

但是为了优化，最好使用seek()。这样就不需要迭代不必要的行。 - Jo Ko

如果效率是考虑因素的话，我会推荐使用itertools.islice。这样你甚至不需要把已用过的行加载到内存中。 - axwr

你介意展示一个使用itertools.islice的例子吗？ - Jo Ko

2

@Jo Ko：你做不到。一行是由某些字符定义的，必须有某种工具读取它们才能知道它们在哪里，除非你为文件建立了外部索引。 - Max

是的，islice 并不能防止每一行被加载到内存中，它只是按需迭代每一行。这个解决方案也是如此，但 islice 用于获取切片。 - juanpa.arrivillaga

@JoKo。我有一个建议，可以在我的答案中使用低级别的“read”方法。我认为它并不比迭代更高效，但您可能仍会喜欢它。 - Mad Physicist

2

只有当文件中的所有行都具有相同的长度，而且您事先知道这一点，且您的文件是二进制文件或至少是 ASCII 纯文本文件（即不允许使用可变宽字符）时，seek 方法才能帮助您。那么您真的可以执行以下操作：

test_file.seek(10 * (length_of_line + 1), os.SEEK_SET)

这是因为seek会将内部文件指针移动固定字节数，而不是行数。上面的+1是为了考虑换行符。在使用\r\n线路终止符的Windows机器上，您可能需要将其设置为+2。

如果您的文件是非ASCII文件，则此方法不起作用，因为某些行可能在字符长度上相同，但实际包含不同数量的字节，使得对seek的调用产生未定义的结果。

有几种合法的方法可以跳过前10行：

Read the whole file into a list and discard the first 10 lines:
```
with open(...) as test_file:
    test_data = list(test_file)[10:]
```
Now test_data contains all the lines in your file besides the first 10.
Discard lines from the file as you read it in a for loop using enumerate:
```
with open(...) as test_file:
    for lineno, line in test_file:
        if lineno < 10:
            continue
        # Do something with the line
```
This method has the advantage of not storing the unnecessary lines. This is functionally similar to using itertools.islice as some of the other answers suggest.
Use some really arcane low-level stuff to actually read 10 newline characters from the file before proceeding normally. You may have to specify the encoding of the file up-front for this to work correctly with text I/O, but it should work out-of-the-box for ASCII files (see here for more details):
```
newline_count = 10
with open(..., encoding='utf-8') as test_file:
    while newline_count > 0:
        next_char = test_file.read(1)
        if next_char == '\n':
            newline_count -= 1
    # You have skipped 10 lines, so process normally here.
```
This option is not particularly robust. It does not handle the case where there are fewer than 10 lines gracefully, and it re-implements the internal machinery of the built-in file iterator very crudely. The only possible advantage it offers is that it does not buffer entire lines like the iterator does.

- Mad Physicist

除非它是一个二进制文件，否则test_file.seek(10 * (length_of_line + 1))将是未定义的。来自Python文档的说明："偏移量必须是由TextIOBase.tell()返回的数字或零。任何其他偏移值都会产生未定义的行为。" - iafisher

@iafisher。好的发现。已修复。 - Mad Physicist

我认为还是有问题。whence参数（第二个参数）默认为os.SEEK_SET；问题在于offset参数（第一个参数）只能为0或通过调用tell返回的值。这与C语言中的fseek函数相同。 - iafisher

@iafisher。你说得对。我认为这个问题出现在非ASCII文本文件中，因为即使是像read（1）这样的低级函数也可以将多字节字符作为单个单位返回。我会添加一个类似于我在项目#3中所做的注释。 - Mad Physicist

@iafisher。请告诉我您是否批准最新的编辑。我认为它纠正了您注意到的问题。 - Mad Physicist

1

你不能使用 seek() 来定位到特定行的开头，除非你知道所需行的第一个字符的字节偏移量。

一种简单的方法是使用 itertools 模块中的 islice() 迭代器。

例如，假设你有一个非常大的输入文件，看起来像这样：

示例代码：

from __future__ import print_function
from itertools import islice

with open('test_file.txt') as test_file:
    for i, line in enumerate(islice(test_file, 9, None), 10):
        print('line #{}: {}'.format(i, line), end='')

输出：

line #10: 10
line #11: 11
line #12: 12
line #13: 13
line #14: 14
line #15: 15
line #16: 16
line #17: 17
line #18: 18
line #19: 19
line #20: 20
line #21: 21
line #22: 22
...

注意： islice() 是从零开始计数的，这就是为什么它的第一个参数是 9 而不是 10。此外，这种方法不如使用 seek() 快，因为 islice() 实际上会读取所有行，直到找到想要开始的那一行。

- martineau

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- alexis · Accepted Answer

普通文件在文件系统层面和Python层面上都是字符的序列；没有一种底层方法可以跳到特定的行。seek()命令计算偏移量的单位是字节，而不是行数。（原则上，只有在以二进制模式打开文件时才应明确使用seek偏移量。在文本文件上寻找是“未定义行为”，因为逻辑字符可能需要多个字节。）

如果您想要跳过若干行，则唯一的选择是读取并丢弃它们。由于迭代文件对象会一次获取一个行，所以使您的代码工作的简洁方式是使用itertools.islice()：

from itertools import islice

skipped = islice(test_file, 10, None)  # Skip 10 lines, i.e. start at index 10
for i, line in enumerate(skipped, 11):
    print(i, line, end="")
    ...