Python中读取大文件的懒惰方法？

Question

Python中读取大文件的懒惰方法？

pythonfile-iogenerator

378

我有一个非常大的4GB文件，当我尝试读取它时，我的计算机会卡住。因此，我想逐块读取它，并在处理完每个块后将已处理的块存储到另一个文件中，然后再读取下一块。

是否有一种方法可以使用 yield 来分块读取？

我希望有一种 懒加载的方法。

- david makcenzie

12个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- officialrahulmandal · Answer 1

更新：如果你想要以完整的行为单位获取结果，也可以使用file_object.readlines。这意味着在结果中不会出现未完成的行。

例如：

def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.readlines(chunk_size)
        if not data:
            break
        yield data

-- 在回答中添加 --

当我以块的形式读取文件时，假设一个名为split.txt的文本文件，我遇到的问题是我有一个使用情况，在处理数据时按行处理，由于我正在以块的方式读取文本文件（文件块）有时会以不完整的行结尾，这可能会破坏我的代码（因为它期望处理完整行）

所以在这里读了一些资料之后，我知道我可以通过记录块中的最后一部分来解决这个问题，如果块中有一个/n，那么就意味着块包含一个完整的行，否则我通常会存储不完整的最后一行并将其保留在变量中，这样我就可以使用这个位并将其与下一个未完成的行连接起来。通过这样做，我成功地解决了这个问题。

样例代码：

# in this function i am reading the file in chunks
def read_in_chunks(file_object, chunk_size=1024):
    """Lazy function (generator) to read a file piece by piece.
    Default chunk size: 1k."""
    while True:
        data = file_object.read(chunk_size)
        if not data:
            break
        yield data

# file where i am writing my final output
write_file=open('split.txt','w')

# variable i am using to store the last partial line from the chunk
placeholder= ''
file_count=1

try:
    with open('/Users/rahulkumarmandal/Desktop/combined.txt') as f:
        for piece in read_in_chunks(f):
            #print('---->>>',piece,'<<<--')
            line_by_line = piece.split('\n')

            for one_line in line_by_line:
                # if placeholder exist before that means last chunk have a partial line that we need to concatenate with the current one 
                if placeholder:
                    # print('----->',placeholder)
                    # concatinating the previous partial line with the current one
                    one_line=placeholder+one_line
                    # then setting the placeholder empty so that next time if there's a partial line in the chunk we can place it in the variable to be concatenated further
                    placeholder=''
                
                # futher logic that revolves around my specific use case
                segregated_data= one_line.split('~')
                #print(len(segregated_data),type(segregated_data), one_line)
                if len(segregated_data) < 18:
                    placeholder=one_line
                    continue
                else:
                    placeholder=''
                #print('--------',segregated_data)
                if segregated_data[2]=='2020' and segregated_data[3]=='2021':
                    #write this
                    data=str("~".join(segregated_data))
                    #print('data',data)
                    #f.write(data)
                    write_file.write(data)
                    write_file.write('\n')
                    print(write_file.tell())
                elif segregated_data[2]=='2021' and segregated_data[3]=='2022':
                    #write this
                    data=str("-".join(segregated_data))
                    write_file.write(data)
                    write_file.write('\n')
                    print(write_file.tell())
except Exception as e:
    print('error is', e)

- Shrikant · Answer 2

您可以使用以下代码。

file_obj = open('big_file')

open() 返回一个文件对象

然后使用 os.stat 获取文件大小

file_size = os.stat('big_file').st_size

for i in range( file_size/1024):
    print file_obj.read(1024)