如何以低廉的价格在Python中获取大文件的行数

Question

如何以低廉的价格在Python中获取大文件的行数

1289

如何以最节省内存和时间的方式获取大文件的行数？

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

- SilentGhost

16

您需要精确的行数还是近似值就可以了？ - pico

61

由于此代码无法处理空文件，因此我建议在for循环之前添加i = -1。 - Maciek Sawicki

14

@Legend: 我敢打赌Pico正在考虑使用seek(0,2)或相似函数获取文件大小，然后将文件大小除以大约的行长度来计算。你可以读取开始几行来猜测平均行长度。 - Anne

41

将 enumerate(f, 1) 代替 range(len(f)) 并省略 i + 1？ - Ian Mackinnon

6

适用于空文件，但在 for 循环之前必须将 i 初始化为 0。 - scai

显示剩余6条评论

45个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jean-Francois T. · Answer 1

已经有很多关于时间比较的答案了，但我相信它们只是通过行数来衡量性能（例如，来自Nico Schlömer的优秀图表 https://dev59.com/X3RA5IYBdhLWcg3wvQlh#68385697）。

为了准确地衡量性能，我们应该考虑以下因素：

行数
平均行长度
... 文件的总大小（可能会影响内存）

首先，OP函数（带有for循环）和sum(1 for line in f)函数的性能并不好...

好的选择是使用mmap或buffer。

总结一下：根据我的分析（在Windows上使用Python 3.9和SSD）：

对于具有相对较短行（不超过100个字符）的大文件：使用带有缓冲区的函数buf_count_newlines_gen

def buf_count_newlines_gen(fname: str) -> int:
    """Count the number of lines in a file"""
    def _make_gen(reader):
        b = reader(1024 * 1024)
        while b:
            yield b
            b = reader(1024 * 1024)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count

对于可能有较长行（最多2000个字符）的文件，不考虑行数，请使用具有mmap功能的函数：count_nb_lines_mmap。

def count_nb_lines_mmap(file: Path) -> int:
    """Count the number of lines in a file"""
    with open(file, mode="rb") as f:
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        nb_lines = 0
        while mm.readline():
            nb_lines += 1
        mm.close()
        return nb_lines

对于性能非常好的短代码（尤其是适用于中等大小的文件）：

def itercount(filename: str) -> int:
    """Count the number of lines in a file"""
    with open(filename, 'rbU') as f:
        return sum(1 for _ in f)

这里是不同指标的摘要（使用timeit在每个循环中进行7次运行的平均时间）：

功能	小文件，短行	小文件，长行	大文件，短行	大文件，长行	更大文件，短行
... 大小 ...	0.04 MB	1.16 MB	318 MB	17 MB	328 MB
... 行数 ...	915 行 < 100 字符	915 行 < 2000 字符	389,000 行 < 100 字符	389,000 行 < 2000 字符	9.8 百万行 < 100 字符
`count_nb_lines_blocks`	0.183 毫秒	1.718 毫秒	36.799 毫秒	415.393 毫秒	517.920 毫秒
`count_nb_lines_mmap`	0.185 毫秒	0.582 毫秒	44.801 毫秒	185.461 毫秒	691.637 毫秒
`buf_count_newlines_gen`	0.665 毫秒	1.032 毫秒	15.620 毫秒	213.458 毫秒	318.939 毫秒
`itercount`	0.135 毫秒	0.817 毫秒	31.292 毫秒	223.120 毫秒	628.760 毫秒

注意：我还比较了在一个大小为8GB的文件上使用count_nb_lines_mmap和buf_count_newlines_gen函数，该文件包含超过800个字符的970万行。我们得到了buf_count_newlines_gen平均耗时5.39秒，而count_nb_lines_mmap平均耗时4.2秒，因此后者在处理更长行的文件时确实更好。

以下是我使用的代码：

import mmap
from pathlib import Path

def count_nb_lines_blocks(file: Path) -> int:
    """Count the number of lines in a file"""

    def blocks(files, size=65536):
        while True:
            b = files.read(size)
            if not b:
                break
            yield b

    with open(file, encoding="utf-8", errors="ignore") as f:
        return sum(bl.count("\n") for bl in blocks(f))


def count_nb_lines_mmap(file: Path) -> int:
    """Count the number of lines in a file"""
    with open(file, mode="rb") as f:
        mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
        nb_lines = 0
        while mm.readline():
            nb_lines += 1
        mm.close()
        return nb_lines


def count_nb_lines_sum(file: Path) -> int:
    """Count the number of lines in a file"""
    with open(file, "r", encoding="utf-8", errors="ignore") as f:
        return sum(1 for line in f)


def count_nb_lines_for(file: Path) -> int:
    """Count the number of lines in a file"""
    i = 0
    with open(file) as f:
        for i, _ in enumerate(f, start=1):
            pass
    return i


def buf_count_newlines_gen(fname: str) -> int:
    """Count the number of lines in a file"""
    def _make_gen(reader):
        b = reader(1024 * 1024)
        while b:
            yield b
            b = reader(1024 * 1024)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count


def itercount(filename: str) -> int:
    """Count the number of lines in a file"""
    with open(filename, 'rbU') as f:
        return sum(1 for _ in f)


files = [small_file, big_file, small_file_shorter, big_file_shorter, small_file_shorter_sim_size, big_file_shorter_sim_size]
for file in files:
    print(f"File: {file.name} (size: {file.stat().st_size / 1024 ** 2:.2f} MB)")
    for func in [
        count_nb_lines_blocks,
        count_nb_lines_mmap,
        count_nb_lines_sum,
        count_nb_lines_for,
        buf_count_newlines_gen,
        itercount,
    ]:
        result = func(file)
        time = Timer(lambda: func(file)).repeat(7, 10)
        print(f" * {func.__name__}: {result} lines in {mean(time) / 10 * 1000:.3f} ms")
    print()

File: small_file.ndjson (size: 1.16 MB)
 * count_nb_lines_blocks: 915 lines in 1.718 ms
 * count_nb_lines_mmap: 915 lines in 0.582 ms
 * count_nb_lines_sum: 915 lines in 1.993 ms
 * count_nb_lines_for: 915 lines in 3.876 ms
 * buf_count_newlines_gen: 915 lines in 1.032 ms
 * itercount: 915 lines in 0.817 ms

File: big_file.ndjson (size: 317.99 MB)
 * count_nb_lines_blocks: 389000 lines in 415.393 ms
 * count_nb_lines_mmap: 389000 lines in 185.461 ms
 * count_nb_lines_sum: 389000 lines in 485.370 ms
 * count_nb_lines_for: 389000 lines in 967.075 ms
 * buf_count_newlines_gen: 389000 lines in 213.458 ms
 * itercount: 389000 lines in 223.120 ms

File: small_file__shorter.ndjson (size: 0.04 MB)
 * count_nb_lines_blocks: 915 lines in 0.183 ms
 * count_nb_lines_mmap: 915 lines in 0.185 ms
 * count_nb_lines_sum: 915 lines in 0.251 ms
 * count_nb_lines_for: 915 lines in 0.244 ms
 * buf_count_newlines_gen: 915 lines in 0.665 ms
 * itercount: 915 lines in 0.135 ms

File: big_file__shorter.ndjson (size: 17.42 MB)
 * count_nb_lines_blocks: 389000 lines in 36.799 ms
 * count_nb_lines_mmap: 389000 lines in 44.801 ms
 * count_nb_lines_sum: 389000 lines in 59.068 ms
 * count_nb_lines_for: 389000 lines in 81.387 ms
 * buf_count_newlines_gen: 389000 lines in 15.620 ms
 * itercount: 389000 lines in 31.292 ms

File: small_file__shorter_sim_size.ndjson (size: 1.21 MB)
 * count_nb_lines_blocks: 36457 lines in 1.920 ms
 * count_nb_lines_mmap: 36457 lines in 2.615 ms
 * count_nb_lines_sum: 36457 lines in 3.993 ms
 * count_nb_lines_for: 36457 lines in 6.011 ms
 * buf_count_newlines_gen: 36457 lines in 1.363 ms
 * itercount: 36457 lines in 2.147 ms

File: big_file__shorter_sim_size.ndjson (size: 328.19 MB)
 * count_nb_lines_blocks: 9834248 lines in 517.920 ms
 * count_nb_lines_mmap: 9834248 lines in 691.637 ms
 * count_nb_lines_sum: 9834248 lines in 1109.669 ms
 * count_nb_lines_for: 9834248 lines in 1683.859 ms
 * buf_count_newlines_gen: 9834248 lines in 318.939 ms
 * itercount: 9834248 lines in 628.760 ms

- 0x90 · Answer 2

创建一个可执行的脚本文件，命名为count.py:

#!/usr/bin/python

import sys
count = 0
for line in sys.stdin:
    count+=1

然后将文件的内容导入到Python脚本中：cat huge.txt | ./count.py。管道在Powershell上也适用，因此您最终会计算出行数。

对于我来说，在Linux上，它比天真的解决方案快30％：

count=1
with open('huge.txt') as f:
    count+=1

- DesiKeki · Answer 3

我会使用最简单和最短的方法：

f = open("my_file.txt", "r")
len(f.readlines())

- Captain Peter · Answer 4

我发现你可以只是这样。

f = open("data.txt")
linecout = len(f.readlines())

会给你一个答案

- odwl · Answer 5

这个怎么样？

def file_len(fname):
  counts = itertools.count()
  with open(fname) as f: 
    for _ in f: counts.next()
  return counts.next()

- Victor · Answer 6

你可以按照以下方式使用 os.path 模块：

import os
import subprocess
Number_lines = int( (subprocess.Popen( 'wc -l {0}'.format( Filename ), shell=True, stdout=subprocess.PIPE).stdout).readlines()[0].split()[0] )

其中文件名是文件的绝对路径。

- Georg Schölly · Answer 7

为什么不读取文件的前100行和后100行，估算平均行长度，然后将文件总大小除以该数字？如果您不需要精确值，这可能有效。

- Karthik · Answer 8

如果文件可以放入内存中，那么

with open(fname) as f:
    count = len(f.read().split(b'\n')) - 1

- J.J. · Answer 9

另一种可能性：

import subprocess

def num_lines_in_file(fpath):
    return int(subprocess.check_output('wc -l %s' % fpath, shell=True).strip().split()[0])

- Jet Blue · Answer 10

如果您的文件中所有行的长度都相同（且仅包含ASCII字符），则可以非常便宜地执行以下操作：

fileSize     = os.path.getsize( pathToFile )  # file size in bytes
bytesPerLine = someInteger                    # don't forget to account for the newline character
numLines     = fileSize // bytesPerLine

*如果使用像é这样的Unicode字符，则需要更多的努力来确定一行中的字节数。