已经有很多关于时间比较的答案了,但我相信它们只是通过行数来衡量性能(例如,来自Nico Schlömer的优秀图表 https://dev59.com/X3RA5IYBdhLWcg3wvQlh#68385697)。
为了准确地衡量性能,我们应该考虑以下因素:
- 行数
- 平均行长度
- ... 文件的总大小(可能会影响内存)
首先,OP函数(带有for
循环)和sum(1 for line in f)
函数的性能并不好...
好的选择是使用mmap
或buffer
。
总结一下:根据我的分析(在Windows上使用Python 3.9和SSD):
- 对于具有相对较短行(不超过100个字符)的大文件:使用带有缓冲区的函数
buf_count_newlines_gen
def buf_count_newlines_gen(fname: str) -> int:
"""Count the number of lines in a file"""
def _make_gen(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024 * 1024)
with open(fname, "rb") as f:
count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
return count
对于可能有较长行(最多2000个字符)的文件,不考虑行数,请使用具有mmap功能的函数:
count_nb_lines_mmap
。
def count_nb_lines_mmap(file: Path) -> int:
"""Count the number of lines in a file"""
with open(file, mode="rb") as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
nb_lines = 0
while mm.readline():
nb_lines += 1
mm.close()
return nb_lines
对于性能非常好的短代码(尤其是适用于中等大小的文件):
def itercount(filename: str) -> int:
"""Count the number of lines in a file"""
with open(filename, 'rbU') as f:
return sum(1 for _ in f)
这里是不同指标的摘要(使用
timeit
在每个循环中进行7次运行的平均时间):
功能 |
小文件,短行 |
小文件,长行 |
大文件,短行 |
大文件,长行 |
更大文件,短行 |
... 大小 ... |
0.04 MB |
1.16 MB |
318 MB |
17 MB |
328 MB |
... 行数 ... |
915 行 < 100 字符 |
915 行 < 2000 字符 |
389,000 行 < 100 字符 |
389,000 行 < 2000 字符 |
9.8 百万行 < 100 字符 |
count_nb_lines_blocks |
0.183 毫秒 |
1.718 毫秒 |
36.799 毫秒 |
415.393 毫秒 |
517.920 毫秒 |
count_nb_lines_mmap |
0.185 毫秒 |
0.582 毫秒 |
44.801 毫秒 |
185.461 毫秒 |
691.637 毫秒 |
buf_count_newlines_gen |
0.665 毫秒 |
1.032 毫秒 |
15.620 毫秒 |
213.458 毫秒 |
318.939 毫秒 |
itercount |
0.135 毫秒 |
0.817 毫秒 |
31.292 毫秒 |
223.120 毫秒 |
628.760 毫秒 |
注意:我还比较了在一个大小为8GB的文件上使用
count_nb_lines_mmap
和
buf_count_newlines_gen
函数,该文件包含超过800个字符的970万行。我们得到了
buf_count_newlines_gen
平均耗时5.39秒,而
count_nb_lines_mmap
平均耗时4.2秒,因此后者在处理更长行的文件时确实更好。
以下是我使用的代码:
import mmap
from pathlib import Path
def count_nb_lines_blocks(file: Path) -> int:
"""Count the number of lines in a file"""
def blocks(files, size=65536):
while True:
b = files.read(size)
if not b:
break
yield b
with open(file, encoding="utf-8", errors="ignore") as f:
return sum(bl.count("\n") for bl in blocks(f))
def count_nb_lines_mmap(file: Path) -> int:
"""Count the number of lines in a file"""
with open(file, mode="rb") as f:
mm = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)
nb_lines = 0
while mm.readline():
nb_lines += 1
mm.close()
return nb_lines
def count_nb_lines_sum(file: Path) -> int:
"""Count the number of lines in a file"""
with open(file, "r", encoding="utf-8", errors="ignore") as f:
return sum(1 for line in f)
def count_nb_lines_for(file: Path) -> int:
"""Count the number of lines in a file"""
i = 0
with open(file) as f:
for i, _ in enumerate(f, start=1):
pass
return i
def buf_count_newlines_gen(fname: str) -> int:
"""Count the number of lines in a file"""
def _make_gen(reader):
b = reader(1024 * 1024)
while b:
yield b
b = reader(1024 * 1024)
with open(fname, "rb") as f:
count = sum(buf.count(b"\n") for buf in _make_gen(f.raw.read))
return count
def itercount(filename: str) -> int:
"""Count the number of lines in a file"""
with open(filename, 'rbU') as f:
return sum(1 for _ in f)
files = [small_file, big_file, small_file_shorter, big_file_shorter, small_file_shorter_sim_size, big_file_shorter_sim_size]
for file in files:
print(f"File: {file.name} (size: {file.stat().st_size / 1024 ** 2:.2f} MB)")
for func in [
count_nb_lines_blocks,
count_nb_lines_mmap,
count_nb_lines_sum,
count_nb_lines_for,
buf_count_newlines_gen,
itercount,
]:
result = func(file)
time = Timer(lambda: func(file)).repeat(7, 10)
print(f" * {func.__name__}: {result} lines in {mean(time) / 10 * 1000:.3f} ms")
print()
File: small_file.ndjson (size: 1.16 MB)
* count_nb_lines_blocks: 915 lines in 1.718 ms
* count_nb_lines_mmap: 915 lines in 0.582 ms
* count_nb_lines_sum: 915 lines in 1.993 ms
* count_nb_lines_for: 915 lines in 3.876 ms
* buf_count_newlines_gen: 915 lines in 1.032 ms
* itercount: 915 lines in 0.817 ms
File: big_file.ndjson (size: 317.99 MB)
* count_nb_lines_blocks: 389000 lines in 415.393 ms
* count_nb_lines_mmap: 389000 lines in 185.461 ms
* count_nb_lines_sum: 389000 lines in 485.370 ms
* count_nb_lines_for: 389000 lines in 967.075 ms
* buf_count_newlines_gen: 389000 lines in 213.458 ms
* itercount: 389000 lines in 223.120 ms
File: small_file__shorter.ndjson (size: 0.04 MB)
* count_nb_lines_blocks: 915 lines in 0.183 ms
* count_nb_lines_mmap: 915 lines in 0.185 ms
* count_nb_lines_sum: 915 lines in 0.251 ms
* count_nb_lines_for: 915 lines in 0.244 ms
* buf_count_newlines_gen: 915 lines in 0.665 ms
* itercount: 915 lines in 0.135 ms
File: big_file__shorter.ndjson (size: 17.42 MB)
* count_nb_lines_blocks: 389000 lines in 36.799 ms
* count_nb_lines_mmap: 389000 lines in 44.801 ms
* count_nb_lines_sum: 389000 lines in 59.068 ms
* count_nb_lines_for: 389000 lines in 81.387 ms
* buf_count_newlines_gen: 389000 lines in 15.620 ms
* itercount: 389000 lines in 31.292 ms
File: small_file__shorter_sim_size.ndjson (size: 1.21 MB)
* count_nb_lines_blocks: 36457 lines in 1.920 ms
* count_nb_lines_mmap: 36457 lines in 2.615 ms
* count_nb_lines_sum: 36457 lines in 3.993 ms
* count_nb_lines_for: 36457 lines in 6.011 ms
* buf_count_newlines_gen: 36457 lines in 1.363 ms
* itercount: 36457 lines in 2.147 ms
File: big_file__shorter_sim_size.ndjson (size: 328.19 MB)
* count_nb_lines_blocks: 9834248 lines in 517.920 ms
* count_nb_lines_mmap: 9834248 lines in 691.637 ms
* count_nb_lines_sum: 9834248 lines in 1109.669 ms
* count_nb_lines_for: 9834248 lines in 1683.859 ms
* buf_count_newlines_gen: 9834248 lines in 318.939 ms
* itercount: 9834248 lines in 628.760 ms
enumerate(f, 1)
代替range(len(f))
并省略i + 1
? - Ian Mackinnon