我有一个非常大的4GB文件,当我尝试读取它时,我的计算机会卡住。因此,我想逐块读取它,并在处理完每个块后将已处理的块存储到另一个文件中,然后再读取下一块。
是否有一种方法可以使用 yield
来分块读取?
我希望有一种 懒加载的方法。
yield
:def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
with open('really_big_file.dat') as f:
for piece in read_in_chunks(f):
process_data(piece)
另一种选择是使用iter
和帮助函数:
f = open('really_big_file.dat')
def read1k():
return f.read(1024)
for piece in iter(read1k, ''):
process_data(piece)
for line in open('really_big_file.dat'):
process_data(line)
file.readlines()
方法接受一个可选的size参数,它大概估算出读取的行数。
bigfile = open('bigfilename','r')
tmp_lines = bigfile.readlines(BUF_SIZE)
while tmp_lines:
process([line for line in tmp_lines])
tmp_lines = bigfile.readlines(BUF_SIZE)
.read()
而不是.readlines()
。如果文件是二进制的,它将没有换行符。 - Myers CarpenterBUF_SIZE
块,然后不必要地将 这些 BUF_SIZE
块拆分为一个 list
。为什么不直接使用 file.readline(BUF_SIZE)
呢?(显然,那也很丑陋——只是不会那么丑陋...) - JamesTheAwesomeDude已经有很多好的回答了,但如果您的整个文件只有一行,并且仍然想处理“行”(而不是固定大小的块),这些答案将无法帮助您。
99% 的时间,可以逐行处理文件。然后,如此回答中建议的那样,您可以将文件对象本身用作延迟生成器:
with open('big.csv') as f:
for line in f:
process(line)
然而,当行分隔符不是'\n'
时(一个常见的例子是'|'
),可能会遇到非常大的文件。
'|'
转换为'\n'
可能不是一个选项,因为它可能会搞乱那些可能包含'\n'
的字段(例如:自由文本用户输入)。对于这种情况,我创建了以下代码片段[更新于2021年5月,适用于Python 3.8+]:
def rows(f, chunksize=1024, sep='|'):
"""
Read a file where the row separator is '|' lazily.
Usage:
>>> with open('big.csv') as f:
>>> for r in rows(f):
>>> process(r)
"""
row = ''
while (chunk := f.read(chunksize)) != '': # End of file
while (i := chunk.find(sep)) != -1: # No separator found
yield row + chunk[:i]
chunk = chunk[i+1:]
row = ''
row += chunk
yield row
[对于旧版本的Python]:
def rows(f, chunksize=1024, sep='|'):
"""
Read a file where the row separator is '|' lazily.
Usage:
>>> with open('big.csv') as f:
>>> for r in rows(f):
>>> process(r)
"""
curr_row = ''
while True:
chunk = f.read(chunksize)
if chunk == '': # End of file
yield curr_row
break
while True:
i = chunk.find(sep)
if i == -1:
break
yield curr_row + chunk[:i]
curr_row = ''
chunk = chunk[i+1:]
curr_row += chunk
我能够成功地使用它来解决各种问题。它经过广泛测试,具有不同的块大小。以下是我使用的测试套件,供需要自行验证的人使用:
test_file = 'test_file'
def cleanup(func):
def wrapper(*args, **kwargs):
func(*args, **kwargs)
os.unlink(test_file)
return wrapper
@cleanup
def test_empty(chunksize=1024):
with open(test_file, 'w') as f:
f.write('')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1_char_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_1_char(chunksize=1024):
with open(test_file, 'w') as f:
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1025_chars_1_row(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1025):
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1
@cleanup
def test_1024_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1023):
f.write('a')
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_1025_chars_1026_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1025):
f.write('|')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 1026
@cleanup
def test_2048_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1022):
f.write('a')
f.write('|')
f.write('a')
# -- end of 1st chunk --
for i in range(1024):
f.write('a')
# -- end of 2nd chunk
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
@cleanup
def test_2049_chars_2_rows(chunksize=1024):
with open(test_file, 'w') as f:
for i in range(1022):
f.write('a')
f.write('|')
f.write('a')
# -- end of 1st chunk --
for i in range(1024):
f.write('a')
# -- end of 2nd chunk
f.write('a')
with open(test_file) as f:
assert len(list(rows(f, chunksize=chunksize))) == 2
if __name__ == '__main__':
for chunksize in [1, 2, 4, 8, 16, 32, 64, 128, 256, 512, 1024]:
test_empty(chunksize)
test_1_char_2_rows(chunksize)
test_1_char(chunksize)
test_1025_chars_1_row(chunksize)
test_1024_chars_2_rows(chunksize)
test_1025_chars_1026_rows(chunksize)
test_2048_chars_2_rows(chunksize)
test_2049_chars_2_rows(chunksize)
import mmap
with open("hello.txt", "r+") as f:
# memory-map the file, size 0 means whole file
map = mmap.mmap(f.fileno(), 0)
# read content via standard file methods
print map.readline() # prints "Hello Python!"
# read content via slice notation
print map[:5] # prints "Hello"
# update content using slice notation;
# note that new content must have same size
map[6:] = " world!\n"
# ... and read again using standard file methods
map.seek(0)
print map.readline() # prints "Hello world!"
# close the map
map.close()
while data := f.readlines()
与您正在评论的答案中的代码非常不同,它使用了不同的函数。 readlines()
会读取整个文件,因此在while
循环中多次执行它几乎肯定是一个错误。 - user3064538f = ... # file-like object, i.e. supporting read(size) function and
# returning empty string '' when there is nothing to read
def chunked(file, chunk_size):
return iter(lambda: file.read(chunk_size), '')
for data in chunked(f, 65536):
# process the data
请参考Python官方文档https://docs.python.org/3/library/functions.html#iter。
也许这种方法更符合Python风格:
"""A file object returned by open() is a iterator with
read method which could specify current read's block size
"""
with open('mydata.db', 'r') as f_in:
block_read = partial(f_in.read, 1024 * 1024)
block_iterator = iter(block_read, '')
for index, block in enumerate(block_iterator, start=1):
block = process_block(block) # process your block data
with open(f'{index}.txt', 'w') as f_out:
f_out.write(block)
for pkt in iter(partial(vid.read, PACKET_SIZE ), b""):
- Leroy Scandaldef read_file(path, block_size=1024):
with open(path, 'rb') as f:
while True:
piece = f.read(block_size)
if piece:
yield piece
else:
return
for piece in read_file(path):
process_piece(piece)
由于我的声望较低,我无法发表评论,但SilentGhosts的解决方案应该更容易实现,使用file.readlines([sizehint])即可。
编辑:SilentGhost是正确的,但这比以下方法更好:
s = ""
for i in xrange(100):
s += file.next()
我处于相似的情况。不清楚你知道以字节为单位的块大小;我通常不知道,但需要的记录数(行数)已知:
def get_line():
with open('4gb_file') as file:
for i in file:
yield i
lines_required = 100
gen = get_line()
chunk = [i for i, j in zip(gen, range(lines_required))]
更新:感谢nosklo。这就是我的意思。它几乎可以工作,除了在“块”之间丢失了一行。
chunk = [next(gen) for i in range(lines_required)]
这个方法可以不丢失任何行,但是看起来不是很好看。
rb
参数;同时也缺少一个file.close()
语句(可以使用with open('really_big_file.dat', 'rb') as f:
实现同样的效果);点击这里获取另一个简洁实现的代码。 - cod3monk3y'rb'
不是缺失的。 - jfs'b'
参数,他的数据将很可能被破坏。根据官方文档的说明——"在Windows上,Python区分文本文件和二进制文件; [...] 它会破坏像JPEG或EXE文件中的二进制数据。 在读写这些文件时一定要非常小心,使用二进制模式。" - cod3monk3ybuf_iter = (x for x in iter(lambda: buf.read(1024), ''))
。然后使用for chunk in buf_iter:
循环遍历这些数据块。 - berto