我需要从多个非常大(几百兆字节)的文本文件中提取最后一行以获取某些数据。目前,我使用Python循环遍历所有行,直到文件为空,然后处理返回的最后一行,但我确定有更有效的方法来完成这个任务。
使用Python检索文本文件的最后一行的最佳方法是什么?
我需要从多个非常大(几百兆字节)的文本文件中提取最后一行以获取某些数据。目前,我使用Python循环遍历所有行,直到文件为空,然后处理返回的最后一行,但我确定有更有效的方法来完成这个任务。
使用Python检索文本文件的最后一行的最佳方法是什么?
这并不是一种直接的方法,但可能比简单的Python实现要快得多:
line = subprocess.check_output(['tail', '-1', filename])
tail
吗? - Hojat Modaresiwith open('output.txt', 'r') as f:
lines = f.read().splitlines()
last_line = lines[-1]
print last_line
if lines:
来实现。 - Katu使用文件的seek
方法,并使用负偏移和whence=os.SEEK_END
来从文件末尾读取一个块。在该块中搜索最后一行的结束符并获取其后的所有字符。如果没有行尾,继续向后备份并重复此过程。
def last_line(in_file, block_size=1024, ignore_ending_newline=False):
suffix = ""
in_file.seek(0, os.SEEK_END)
in_file_length = in_file.tell()
seek_offset = 0
while(-seek_offset < in_file_length):
# Read from end.
seek_offset -= block_size
if -seek_offset > in_file_length:
# Limit if we ran out of file (can't seek backward from start).
block_size -= -seek_offset - in_file_length
if block_size == 0:
break
seek_offset = -in_file_length
in_file.seek(seek_offset, os.SEEK_END)
buf = in_file.read(block_size)
# Search for line end.
if ignore_ending_newline and seek_offset == -block_size and buf[-1] == '\n':
buf = buf[:-1]
pos = buf.rfind('\n')
if pos != -1:
# Found line end.
return buf[pos+1:] + suffix
suffix = buf + suffix
# One-line file.
return suffix
buf
(确保比较buf[-1:] == b'\n'
)。如果您确定它是utf-8编码,则可以使用suffix.decode('utf-8')
返回一个字符串。 - Multihunterdef getLastLine(fname, maxLineLength=80):
fp=file(fname, "rb")
fp.seek(-maxLineLength-1, 2) # 2 means "from the end of the file"
return fp.readlines()[-1]
这在我的Windows机器上可以工作。但是如果你在其他平台上以二进制模式打开文本文件,我不知道会发生什么。如果想使用seek()函数,就需要使用二进制模式。
如果您能选择一个合理的最大行长度,您可以在开始阅读之前接近文件的末尾。
myfile.seek(-max_line_length, os.SEEK_END)
line = myfile.readlines()[-1]
寻找文件末尾减去大约100个字节的位置。进行读取并搜索换行符。如果没有换行符,则再向后寻找大约100个字节。重复此过程直到找到换行符,最后一行就在该换行符后面。
最好的情况是只需要读取100个字节。
from os import SEEK_END
def get_last_line(file):
CHUNK_SIZE = 1024 # Would be good to make this the chunk size of the filesystem
last_line = ""
while True:
# We grab chunks from the end of the file towards the beginning until we
# get a new line
file.seek(-len(last_line) - CHUNK_SIZE, SEEK_END)
chunk = file.read(CHUNK_SIZE)
if not chunk:
# The whole file is one big line
return last_line
if not last_line and chunk.endswith('\n'):
# Ignore the trailing newline at the end of the file (but include it
# in the output).
last_line = '\n'
chunk = chunk[:-1]
nl_pos = chunk.rfind('\n')
# What's being searched for will have to be modified if you are searching
# files with non-unix line endings.
last_line = chunk[nl_pos + 1:] + last_line
if nl_pos == -1:
# The whole chunk is part of the last line.
continue
return last_line
n
大于文件大小,file.seek(-n, os.SEEK_END)
将引发IOError: [Errno 22] Invalid argument
。 - Mike DeSimone# Get last line of a text file using seek method. Works with non-constant block size.
# IDK if that speed things up, but it's good enough for us,
# especially with constant line lengths in the file (provided by len_guess),
# in which case the block size doubling is not performed much if at all. Currently,
# we're using this on a textfile format with constant line lengths.
# Requires that the file is opened up in binary mode. No nonzero end-rel seeks in text mode.
REL_FILE_END = 2
def lastTextFileLine(file, len_guess=1):
file.seek(-1, REL_FILE_END) # 1 => go back to position 0; -1 => 1 char back from end of file
text = file.read(1)
tot_sz = 1 # store total size so we know where to seek to next rel file end
if text != b'\n': # if newline is the last character, we want the text right before it
file.seek(0, REL_FILE_END) # else, consider the text all the way at the end (after last newline)
tot_sz = 0
blocks = [] # For storing succesive search blocks, so that we don't end up searching in the already searched
j = file.tell() # j = end pos
not_done = True
block_sz = len_guess
while not_done:
if j < block_sz: # in case our block doubling takes us past the start of the file (here j also = length of file remainder)
block_sz = j
not_done = False
tot_sz += block_sz
file.seek(-tot_sz, REL_FILE_END) # Yes, seek() works with negative numbers for seeking backward from file end
text = file.read(block_sz)
i = text.rfind(b'\n')
if i != -1:
text = text[i+1:].join(reversed(blocks))
return str(text)
else:
blocks.append(text)
block_sz <<= 1 # double block size (converge with open ended binary search-like strategy)
j = j - block_sz # if this doesn't work, try using tmp j1 = file.tell() above
return str(b''.join(reversed(blocks))) # if newline was never found, return everything read
lines = file.readlines()
fileHandle.close()
last_line = lines[-1]
lines[len(lines) -1]
。那是一个 O(n)
操作。lines[-1]
可以获取最后一个元素。此外,这种方法并不比他已经使用的方法更好。 - g.d.d.clines[len(lines)-1]
不是 O(n)(除非 lines
是一个具有 O(n) 实现的用户定义类型的 __len__
方法,但这里不是这种情况)。虽然这种写法不太好,但是 lines[len(lines)-1]
的运行时间成本与 lines[-1]
几乎相同;唯一的区别在于索引计算是由脚本显式完成还是由运行时隐式完成。 - Adam RosenfieldO(1)
操作之前将可能很大的文件读入内存。 - gustafbstrom