如何以低廉的价格在Python中获取大文件的行数

Question

如何以低廉的价格在Python中获取大文件的行数

1289

如何以最节省内存和时间的方式获取大文件的行数？

def file_len(filename):
    with open(filename) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1

- SilentGhost

16

您需要精确的行数还是近似值就可以了？ - pico

61

由于此代码无法处理空文件，因此我建议在for循环之前添加i = -1。 - Maciek Sawicki

14

@Legend: 我敢打赌Pico正在考虑使用seek(0,2)或相似函数获取文件大小，然后将文件大小除以大约的行长度来计算。你可以读取开始几行来猜测平均行长度。 - Anne

41

将 enumerate(f, 1) 代替 range(len(f)) 并省略 i + 1？ - Ian Mackinnon

6

适用于空文件，但在 for 循环之前必须将 i 初始化为 0。 - scai

显示剩余6条评论

45个回答

428

没有比这更好的了。

毕竟，任何解决方案都必须读取整个文件，计算出有多少个\n，并返回结果。

你有更好的方法在不读取整个文件的情况下完成吗？不确定...最好的解决方案总是I/O限制，你能做的就是确保不使用不必要的内存，但看起来你已经做到了。

[2023年5月编辑]

正如其他答案中所评论的，在Python 3中有更好的替代方案。 for循环不是最有效的。例如，使用mmap或缓冲区更有效。

- Yuval Adam

8

没错，即使是 WC 也在阅读文件，但是它是用 C 编写的，而且可能已经进行了优化。 - Ólafur Waage

7

据我所知，Python 文件 IO 也是通过 C 进行的。http://docs.python.org/library/stdtypes.html#file-objects - Tomalak

11

那是个误导。尽管Python和wc可能会发出相同的系统调用，但Python具有操作码分派开销，而wc没有。 - bobpoekert

4

你可以通过抽样来近似计算行数。这种方法的速度可以快上千倍。参考链接：http://www.documentroot.com/2011/02/approximate-line-count-for-very-large.html - Erik Aronesty

6

其他答案似乎表明这个分类回答是错误的，因此应该删除而不是保留为被接受的答案。 - Skippy le Grand Gourou

显示剩余9条评论

229

我相信内存映射文件将是最快的解决方案。我尝试了四个函数：原帖中的函数（opcount）；对文件中的每一行进行简单迭代（simplecount）；使用内存映射文件（mmap）的读取行操作（mapcount）；以及Mykola Kharechko提供的缓冲读取解决方案（bufcount）。

我分别运行了每个函数五次，并计算了一个包含120万行文本文件的平均运行时间。 Windows XP，Python 2.5，2 GB RAM，2 GHz AMD处理器。

以下是我的结果：

mapcount : 0.465599966049
simplecount : 0.756399965286
bufcount : 0.546800041199
opcount : 0.718600034714

Python 2.6的数字：

mapcount : 0.471799945831
simplecount : 0.634400033951
bufcount : 0.468800067902
opcount : 0.602999973297

所以缓冲读取策略似乎是Windows/Python 2.6中最快的。

以下是代码：

from __future__ import with_statement
import time
import mmap
import random
from collections import defaultdict

def mapcount(filename):
    with open(filename, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)
        lines = 0
        readline = buf.readline
        while readline():
            lines += 1
        return lines

def simplecount(filename):
    lines = 0
    for line in open(filename):
        lines += 1
    return lines

def bufcount(filename):
    f = open(filename)
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.read # loop optimization

    buf = read_f(buf_size)
    while buf:
        lines += buf.count('\n')
        buf = read_f(buf_size)

    return lines

def opcount(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1


counts = defaultdict(list)

for i in range(5):
    for func in [mapcount, simplecount, bufcount, opcount]:
        start_time = time.time()
        assert func("big_file.txt") == 1209138
        counts[func].append(time.time() - start_time)

for key, vals in counts.items():
    print key.__name__, ":", sum(vals) / float(len(vals))

- Ryan Ginstrom

34

看起来wccount()是最快的 https://gist.github.com/0ac760859e614cd03652 - jfs

缓冲读取是最快的解决方案，而不是mmap或wccount。请参见https://dev59.com/X3RA5IYBdhLWcg3wvQlh#68385697。 - Nico Schlömer

@NicoSchlömer 这取决于您的文件特性。请参见 https://dev59.com/X3RA5IYBdhLWcg3wvQlh#76197308，以比较两者在不同文件上的表现。 - Jean-Francois T.

218

所有这些解决方案都忽略了一种使其运行速度大大加快的方法，即使用无缓冲（原始）接口，使用字节数组并自行进行缓冲。（这仅适用于Python 3。在Python 2中，原始接口可能会或可能不会默认使用，但在Python 3中，您将默认使用Unicode。）

使用修改过的计时工具版本，我相信以下代码比提供的任何解决方案都更快（并且稍微更符合Python的风格）。

def rawcount(filename):
    f = open(filename, 'rb')
    lines = 0
    buf_size = 1024 * 1024
    read_f = f.raw.read

    buf = read_f(buf_size)
    while buf:
        lines += buf.count(b'\n')
        buf = read_f(buf_size)

    return lines

使用一个单独的生成器函数，这样运行速度会稍微快一些：

def _make_gen(reader):
    b = reader(1024 * 1024)
    while b:
        yield b
        b = reader(1024*1024)

def rawgencount(filename):
    f = open(filename, 'rb')
    f_gen = _make_gen(f.raw.read)
    return sum(buf.count(b'\n') for buf in f_gen)

这可以完全使用内联的生成器表达式和itertools来完成，但是看起来相当奇怪。

from itertools import (takewhile, repeat)

def rawincount(filename):
    f = open(filename, 'rb')
    bufgen = takewhile(lambda x: x, (f.raw.read(1024*1024) for _ in repeat(None)))
    return sum(buf.count(b'\n') for buf in bufgen)

以下是我的时间安排：

function      average, s  min, s   ratio
rawincount        0.0043  0.0041   1.00
rawgencount       0.0044  0.0042   1.01
rawcount          0.0048  0.0045   1.09
bufcount          0.008   0.0068   1.64
wccount           0.01    0.0097   2.35
itercount         0.014   0.014    3.41
opcount           0.02    0.02     4.83
kylecount         0.021   0.021    5.05
simplecount       0.022   0.022    5.25
mapcount          0.037   0.031    7.46

- Michael Bacon

45

我正在处理100Gb+的文件，而你的rawgencounts是我迄今看到的唯一可行的解决方案。谢谢！ - soungalo

2

这个表中的 wccount 是用于子进程 shell 工具 wc 吗？ - Anentropic

7

谢谢 @michael-bacon，这是一个非常好的解决方案。您可以通过使用bufgen = iter(partial(f.raw.read, 1024*1024), b'') 而不是组合takewhile和repeat来使rawincount方案看起来更加清晰。 - Peter H.

2

哦，部分函数，是的，那是一个不错的小调整。此外，我假设1024*1024会被解释器合并并视为常量，但这只是我的猜测，而非文档说明。 - Michael Bacon

3

使用buffering=0打开文件并调用read()，相比于直接使用"rb"模式打开文件并调用raw.read()，是否更快？还是这两种方法会被优化成同样的效果？ - Avraham

显示剩余9条评论

105

您可以执行一个子进程并运行wc -l filename

import subprocess

def file_len(fname):
    p = subprocess.Popen(['wc', '-l', fname], stdout=subprocess.PIPE, 
                                              stderr=subprocess.PIPE)
    result, err = p.communicate()
    if p.returncode != 0:
        raise IOError(err)
    return int(result.strip().split()[0])

- Ólafur Waage

8

这个的 Windows 版本是什么？ - SilentGhost

2

关于这个问题，您可以参考这个 Stack Overflow 的帖子：https://dev59.com/t3VC5IYBdhLWcg3wliKe - Ólafur Waage

7

实际上，在我的电脑（Mac OS X）上，使用该方法计算行数只需要0.13秒，而使用"for x in file(...)"计算行数需要0.5秒，而重复调用str.find或mmap.find计算则需要1.0秒。（我用于测试的文件有130万行。） - bendin

1

在命令行中（无需创建另一个 shell 的开销），这与更清晰、可移植的仅使用 Python 解决方案一样快。参见：https://dev59.com/jnRA5IYBdhLWcg3wuAcp - Davide

3

不支持跨平台。 - e-info128

显示剩余2条评论

66

在进行perfplot分析后，建议采用缓冲读取解决方案。

def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        while True:
            b = reader(2 ** 16)
            if not b: break
            yield b

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count

它快速且占用内存少。大多数其他解决方案要慢20倍左右。

生成该图的代码：

import mmap
import subprocess
from functools import partial

import perfplot


def setup(n):
    fname = "t.txt"
    with open(fname, "w") as f:
        for i in range(n):
            f.write(str(i) + "\n")
    return fname


def for_enumerate(fname):
    i = 0
    with open(fname) as f:
        for i, _ in enumerate(f):
            pass
    return i + 1


def sum1(fname):
    return sum(1 for _ in open(fname))


def mmap_count(fname):
    with open(fname, "r+") as f:
        buf = mmap.mmap(f.fileno(), 0)

    lines = 0
    while buf.readline():
        lines += 1
    return lines


def for_open(fname):
    lines = 0
    for _ in open(fname):
        lines += 1
    return lines


def buf_count_newlines(fname):
    lines = 0
    buf_size = 2 ** 16
    with open(fname) as f:
        buf = f.read(buf_size)
        while buf:
            lines += buf.count("\n")
            buf = f.read(buf_size)
    return lines


def buf_count_newlines_gen(fname):
    def _make_gen(reader):
        b = reader(2 ** 16)
        while b:
            yield b
            b = reader(2 ** 16)

    with open(fname, "rb") as f:
        count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
    return count


def wc_l(fname):
    return int(subprocess.check_output(["wc", "-l", fname]).split()[0])


def sum_partial(fname):
    with open(fname) as f:
        count = sum(x.count("\n") for x in iter(partial(f.read, 2 ** 16), ""))
    return count


def read_count(fname):
    return open(fname).read().count("\n")


b = perfplot.bench(
    setup=setup,
    kernels=[
        for_enumerate,
        sum1,
        mmap_count,
        for_open,
        wc_l,
        buf_count_newlines,
        buf_count_newlines_gen,
        sum_partial,
        read_count,
    ],
    n_range=[2 ** k for k in range(27)],
    xlabel="num lines",
)
b.save("out.png")
b.show()

- Nico Schlömer

1

我的文件中有非常长的行；我认为应该只使用readinto一次分配缓冲区。 - fuzzyTew

很棒的图表：感谢提供代码。但实际上，这忽略了一种情况，即一行超过10个字符的情况。对于长行，mmap往往比buf_count_newlines_gen更有效率：请参见答案https://dev59.com/X3RA5IYBdhLWcg3wvQlh#76197308。 - Jean-Francois T.

49

这是一个使用多进程库在多台机器/核心上分布式计算行数的Python程序。我的测试将一个2000万行的文件的计数时间从26秒缩短到了7秒，使用了一台8核的64位Windows服务器。注意：不使用内存映射会使速度变慢很多。

import multiprocessing, sys, time, os, mmap
import logging, logging.handlers

def init_logger(pid):
    console_format = 'P{0} %(levelname)s %(message)s'.format(pid)
    logger = logging.getLogger()  # New logger at root level
    logger.setLevel(logging.INFO)
    logger.handlers.append(logging.StreamHandler())
    logger.handlers[0].setFormatter(logging.Formatter(console_format, '%d/%m/%y %H:%M:%S'))

def getFileLineCount(queues, pid, processes, file1):
    init_logger(pid)
    logging.info('start')

    physical_file = open(file1, "r")
    #  mmap.mmap(fileno, length[, tagname[, access[, offset]]]

    m1 = mmap.mmap(physical_file.fileno(), 0, access=mmap.ACCESS_READ)

    # Work out file size to divide up line counting

    fSize = os.stat(file1).st_size
    chunk = (fSize / processes) + 1

    lines = 0

    # Get where I start and stop
    _seedStart = chunk * (pid)
    _seekEnd = chunk * (pid+1)
    seekStart = int(_seedStart)
    seekEnd = int(_seekEnd)

    if seekEnd < int(_seekEnd + 1):
        seekEnd += 1

    if _seedStart < int(seekStart + 1):
        seekStart += 1

    if seekEnd > fSize:
        seekEnd = fSize

    # Find where to start
    if pid > 0:
        m1.seek(seekStart)
        # Read next line
        l1 = m1.readline()  # Need to use readline with memory mapped files
        seekStart = m1.tell()

    # Tell previous rank my seek start to make their seek end

    if pid > 0:
        queues[pid-1].put(seekStart)
    if pid < processes-1:
        seekEnd = queues[pid].get()

    m1.seek(seekStart)
    l1 = m1.readline()

    while len(l1) > 0:
        lines += 1
        l1 = m1.readline()
        if m1.tell() > seekEnd or len(l1) == 0:
            break

    logging.info('done')
    # Add up the results
    if pid == 0:
        for p in range(1, processes):
            lines += queues[0].get()
        queues[0].put(lines) # The total lines counted
    else:
        queues[0].put(lines)

    m1.close()
    physical_file.close()

if __name__ == '__main__':
    init_logger('main')
    if len(sys.argv) > 1:
        file_name = sys.argv[1]
    else:
        logging.fatal('parameters required: file-name [processes]')
        exit()

    t = time.time()
    processes = multiprocessing.cpu_count()
    if len(sys.argv) > 2:
        processes = int(sys.argv[2])
    queues = [] # A queue for each process
    for pid in range(processes):
        queues.append(multiprocessing.Queue())
    jobs = []
    prev_pipe = 0
    for pid in range(processes):
        p = multiprocessing.Process(target = getFileLineCount, args=(queues, pid, processes, file_name,))
        p.start()
        jobs.append(p)

    jobs[0].join() # Wait for counting to finish
    lines = queues[0].get()

    logging.info('finished {} Lines:{}'.format( time.time() - t, lines))

- Martlark

这对比主存储器大得多的文件如何工作？例如，在拥有4GB RAM和2个内核的系统上处理20GB的文件。 - Brian Minton

现在很难进行测试，但我推测它会将文件分页进出。 - Martlark

6

这是相当不错的代码。我惊讶地发现使用多个处理器更快。我原本以为IO会成为瓶颈。在旧版Python中，第21行需要像chunk = int((fSize / processes)) + 1这样的int()函数。 - Karl Henselin

它会将所有文件加载到内存中吗？如果有一个比计算机上的RAM更大的文件，该怎么办？ - pelos

1

你介意我用黑色格式化答案吗？https://black.vercel.app/ - Martin Thoma

显示剩余2条评论

47

一个类似于this answer的一行Bash解决方案，使用现代的subprocess.check_output函数：

def line_count(filename):
    return int(subprocess.check_output(['wc', '-l', filename]).split()[0])

- 1''

4

这个答案应该在Linux/Unix用户帖子中得到更高的投票率。尽管跨平台方案是大多数人偏爱的选择，但这是在Linux/Unix上非常好的一种方式。对于我需要从一个1.84亿行的csv文件中取样数据的情况，它提供了最佳的运行时间。其他纯Python解决方案平均需要100多秒，而调用wc -l的子进程只需要约5秒钟。 - Shan Dou

shell=True 对于安全性来说是不好的，最好避免使用它。 - Alexey Vazhnov

18

我会使用Python的文件对象方法readlines，代码如下：

with open(input_file) as foo:
    lines = len(foo.readlines())

这个操作打开文件，创建一个包含文件每一行的列表，计算列表长度并将其保存到一个变量中，然后再关闭文件。

- Daniel Lee

9

虽然这是脑海中首先想到的方法之一，但它可能不太记忆高效，特别是当统计文件中多达10GB（如我所做）的行数时，这是一个值得注意的劣势。 - Steen Schütt

@TimeSheep 这对于有大量（比如数十亿）小行或者有极长行（比如每行几G）的文件是否会成为一个问题？ - robert

我问的原因是，编译器应该能够通过不创建中间列表来进行优化。 - robert

根据Python文档，xreadlines自2.3版本起已被弃用，因为它只返回一个迭代器。for line in file是官方推荐的替代方法。请参阅：https://docs.python.org/2/library/stdtypes.html#file.xreadlines - Kumba

13

这是我使用纯Python找到的最快的东西。

您可以通过设置buffer来使用任意数量的内存，尽管在我的电脑上，2**16似乎是一个比较合适的选择。

from functools import partial

buffer=2**16
with open(myfile) as f:
        print sum(x.count('\n') for x in iter(partial(f.read,buffer), ''))

我在这里找到了答案为什么在C++中从stdin读取行比Python慢得多？并稍微调整了一下。这是一篇非常好的文章，可以帮助我们快速计算行数，尽管wc -l仍然比其他任何方法快大约75%。

- jeffpkamp

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Kyle · Accepted Answer

一行代码，比OP的for循环更快（虽然不是最快的），而且非常简洁。

num_lines = sum(1 for _ in open('myfile.txt'))

你还可以通过使用rbU模式并将其包含在with块中来提高速度（和稳定性），以关闭文件。

with open("myfile.txt", "rbU") as f:
    num_lines = sum(1 for _ in f)

注意：自Python 3.3及以上版本起，rbU模式中的U已被弃用，因此我们应该使用rb而不是rbU（并且在Python 3.11中已被移除）。