我有一个非常大的4GB文件,当我尝试读取它时,我的计算机会卡住。因此,我想逐块读取它,并在处理完每个块后将已处理的块存储到另一个文件中,然后再读取下一块。
是否有一种方法可以使用 yield
来分块读取?
我希望有一种 懒加载的方法。
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.readlines(chunk_size)
if not data:
break
yield data
-- 在回答中添加 --
当我以块的形式读取文件时,假设一个名为split.txt的文本文件,我遇到的问题是我有一个使用情况,在处理数据时按行处理,由于我正在以块的方式读取文本文件(文件块)有时会以不完整的行结尾,这可能会破坏我的代码(因为它期望处理完整行)
所以在这里读了一些资料之后,我知道我可以通过记录块中的最后一部分来解决这个问题,如果块中有一个/n,那么就意味着块包含一个完整的行,否则我通常会存储不完整的最后一行并将其保留在变量中,这样我就可以使用这个位并将其与下一个未完成的行连接起来。通过这样做,我成功地解决了这个问题。
样例代码:
# in this function i am reading the file in chunks
def read_in_chunks(file_object, chunk_size=1024):
"""Lazy function (generator) to read a file piece by piece.
Default chunk size: 1k."""
while True:
data = file_object.read(chunk_size)
if not data:
break
yield data
# file where i am writing my final output
write_file=open('split.txt','w')
# variable i am using to store the last partial line from the chunk
placeholder= ''
file_count=1
try:
with open('/Users/rahulkumarmandal/Desktop/combined.txt') as f:
for piece in read_in_chunks(f):
#print('---->>>',piece,'<<<--')
line_by_line = piece.split('\n')
for one_line in line_by_line:
# if placeholder exist before that means last chunk have a partial line that we need to concatenate with the current one
if placeholder:
# print('----->',placeholder)
# concatinating the previous partial line with the current one
one_line=placeholder+one_line
# then setting the placeholder empty so that next time if there's a partial line in the chunk we can place it in the variable to be concatenated further
placeholder=''
# futher logic that revolves around my specific use case
segregated_data= one_line.split('~')
#print(len(segregated_data),type(segregated_data), one_line)
if len(segregated_data) < 18:
placeholder=one_line
continue
else:
placeholder=''
#print('--------',segregated_data)
if segregated_data[2]=='2020' and segregated_data[3]=='2021':
#write this
data=str("~".join(segregated_data))
#print('data',data)
#f.write(data)
write_file.write(data)
write_file.write('\n')
print(write_file.tell())
elif segregated_data[2]=='2021' and segregated_data[3]=='2022':
#write this
data=str("-".join(segregated_data))
write_file.write(data)
write_file.write('\n')
print(write_file.tell())
except Exception as e:
print('error is', e)
您可以使用以下代码。
file_obj = open('big_file')
open() 返回一个文件对象
然后使用 os.stat 获取文件大小
file_size = os.stat('big_file').st_size
for i in range( file_size/1024):
print file_obj.read(1024)