高效地查找文本文件中的最后一行

Question

高效地查找文本文件中的最后一行

pythontext

52

我需要从多个非常大（几百兆字节）的文本文件中提取最后一行以获取某些数据。目前，我使用Python循环遍历所有行，直到文件为空，然后处理返回的最后一行，但我确定有更有效的方法来完成这个任务。

使用Python检索文本文件的最后一行的最佳方法是什么？

- TimothyAWiseman

你可以看一下这个链接：https://dev59.com/jXVC5IYBdhLWcg3w9GHM 这非常接近你所需要的。 - Martin

这是一个Python问题，还是使用awk或sed的答案同样好？ - Eric Wilson

3

你需要提供一条至关重要的信息（许多答案完全忽略了）：你的文件编码方式。 - John Machin

只有使用多字节编码（例如UTF-16或UTF-32）才能破解所提供的算法。 - Mike DeSimone

谢谢，这让我非常接近目标，稍微调整一下就得到了我需要的结果。 - TimothyAWiseman

@Eric，这是在我的办公室，使用的是Windows环境，所以Python最好，不过PowerShell也可以。 - TimothyAWiseman

11个回答

46

with open('output.txt', 'r') as f:
    lines = f.read().splitlines()
    last_line = lines[-1]
    print last_line

- mick barry

1

最佳解决方案和快速实现。 - salah

47

处理 GB 文本文件并仅需要进行最后一行检查时，效果不是很好。 - john stamos

7

我认为当处理非常大的文本文件时，这种方法不够高效。 - Minh Hoàng

1

索引错误：列表索引超出范围，有没有办法存储更多的数据。 - Mr Coder

在访问索引 -1 之前，应该先检查文件是否为空，可以通过检查 if lines: 来实现。 - Katu

14

使用文件的seek方法，并使用负偏移和whence=os.SEEK_END来从文件末尾读取一个块。在该块中搜索最后一行的结束符并获取其后的所有字符。如果没有行尾，继续向后备份并重复此过程。

def last_line(in_file, block_size=1024, ignore_ending_newline=False):
    suffix = ""
    in_file.seek(0, os.SEEK_END)
    in_file_length = in_file.tell()
    seek_offset = 0

    while(-seek_offset < in_file_length):
        # Read from end.
        seek_offset -= block_size
        if -seek_offset > in_file_length:
            # Limit if we ran out of file (can't seek backward from start).
            block_size -= -seek_offset - in_file_length
            if block_size == 0:
                break
            seek_offset = -in_file_length
        in_file.seek(seek_offset, os.SEEK_END)
        buf = in_file.read(block_size)

        # Search for line end.
        if ignore_ending_newline and seek_offset == -block_size and buf[-1] == '\n':
            buf = buf[:-1]
        pos = buf.rfind('\n')
        if pos != -1:
            # Found line end.
            return buf[pos+1:] + suffix

        suffix = buf + suffix

    # One-line file.
    return suffix

请注意，这将无法在不支持“seek”（如stdin或sockets）的内容上工作。在这些情况下，您只能像“tail”命令那样读取整个内容。

- Mike DeSimone

1

我认为这个答案只在Python 2中正常工作。至少，在Python 3中它对我不起作用，因为你不能在Python 3中从文本文件的末尾寻找相对位置（会抛出一个io异常）。要将其更新为Python 3：使用二进制文件，然后您必须使用字节数组而不是字符串来处理buf（确保比较buf[-1:] == b'\n'）。如果您确定它是utf-8编码，则可以使用suffix.decode('utf-8')返回一个字符串。 - Multihunter

8

如果您知道一行的最大长度，可以执行以下操作：

def getLastLine(fname, maxLineLength=80):
    fp=file(fname, "rb")
    fp.seek(-maxLineLength-1, 2) # 2 means "from the end of the file"
    return fp.readlines()[-1]

这在我的Windows机器上可以工作。但是如果你在其他平台上以二进制模式打开文本文件，我不知道会发生什么。如果想使用seek()函数，就需要使用二进制模式。

- rocksportrocker

2

如果您不知道最大行长度呢？ - Adam Rosenfield

1

这个和Mike的回答都是“正确的方法”，但对于除了简单（单字节，例如ASCII）文本编码之外的任何内容都存在问题。Unicode可以有多字节字符，因此在这种情况下，（1）您不知道给定最大字符长度的相对字节偏移量，（2）您可能会寻找到字符的“中间”。 - andrew cooke

1

@andrew，即使您从字符中间开始，UTF-8中的行尾字节代码仍将是唯一的。这就是UTF-8的美妙之处之一。 - Mark Ransom

1

@andrew：UTF-8 可以在中途同步，因为代码点表示中的字节 >= U+80 全部都有高位设置。因此，如果高位清除，则为低 ASCII 字符。这使得我们解析器编写者感到高兴。另一方面，存在像 Shift-JIS 这样的格式，将非低 ASCII 字符编码为两个字节，但仅保证第一个字节具有高位设置。幸运的是，他们没有将控制字符用于第二个字节。 - Mike DeSimone

1

在Python 3中不支持file()函数，请使用open()函数代替； - Ludo Schmidt

显示剩余6条评论

7

如果您能选择一个合理的最大行长度，您可以在开始阅读之前接近文件的末尾。

myfile.seek(-max_line_length, os.SEEK_END)
line = myfile.readlines()[-1]

- Mark Ransom

我认为你需要在寻找时再向前移动一个字节，因为readlines()包括行终止符。 - rocksportrocker

5

寻找文件末尾减去大约100个字节的位置。进行读取并搜索换行符。如果没有换行符，则再向后寻找大约100个字节。重复此过程直到找到换行符，最后一行就在该换行符后面。

最好的情况是只需要读取100个字节。

- Bryan Oakley

2

这里的低效并不完全是由Python引起的，而是由于文件读取方式的本质造成的。找到最后一行的唯一方法是读取文件并查找行结尾。然而，可以使用“seek”操作跳转到文件中的任何字节偏移量。因此，您可以从文件末尾开始非常接近，并根据需要获取越来越大的块，直到找到最后一个行结束符为止：

from os import SEEK_END

def get_last_line(file):
  CHUNK_SIZE = 1024 # Would be good to make this the chunk size of the filesystem

  last_line = ""

  while True:
    # We grab chunks from the end of the file towards the beginning until we 
    # get a new line
    file.seek(-len(last_line) - CHUNK_SIZE, SEEK_END)
    chunk = file.read(CHUNK_SIZE)

    if not chunk:
      # The whole file is one big line
      return last_line

    if not last_line and chunk.endswith('\n'):
      # Ignore the trailing newline at the end of the file (but include it 
      # in the output).
      last_line = '\n'
      chunk = chunk[:-1]

    nl_pos = chunk.rfind('\n')
    # What's being searched for will have to be modified if you are searching
    # files with non-unix line endings.

    last_line = chunk[nl_pos + 1:] + last_line

    if nl_pos == -1:
      # The whole chunk is part of the last line.
      continue

    return last_line

- Zack Bloom

1

如果n大于文件大小，file.seek(-n, os.SEEK_END)将引发IOError: [Errno 22] Invalid argument。 - Mike DeSimone

1

这里有一个稍微不同的解决方案。与其使用多行，我专注于最后一行，并且使用动态（加倍）块大小，而不是固定块大小。有关更多信息，请参见注释。

# Get last line of a text file using seek method.  Works with non-constant block size.  
# IDK if that speed things up, but it's good enough for us, 
# especially with constant line lengths in the file (provided by len_guess), 
# in which case the block size doubling is not performed much if at all.  Currently,
# we're using this on a textfile format with constant line lengths.
# Requires that the file is opened up in binary mode.  No nonzero end-rel seeks in text mode.
REL_FILE_END = 2
def lastTextFileLine(file, len_guess=1):
    file.seek(-1, REL_FILE_END)      # 1 => go back to position 0;  -1 => 1 char back from end of file
    text = file.read(1)
    tot_sz = 1              # store total size so we know where to seek to next rel file end
    if text != b'\n':        # if newline is the last character, we want the text right before it
        file.seek(0, REL_FILE_END)    # else, consider the text all the way at the end (after last newline)
        tot_sz = 0
    blocks = []           # For storing succesive search blocks, so that we don't end up searching in the already searched
    j = file.tell()          # j = end pos
    not_done = True
    block_sz = len_guess
    while not_done:
        if j < block_sz:   # in case our block doubling takes us past the start of the file (here j also = length of file remainder)
            block_sz = j
            not_done = False
        tot_sz += block_sz
        file.seek(-tot_sz, REL_FILE_END)         # Yes, seek() works with negative numbers for seeking backward from file end
        text = file.read(block_sz)
        i = text.rfind(b'\n')
        if i != -1:
            text = text[i+1:].join(reversed(blocks))
            return str(text)
        else:
            blocks.append(text)
            block_sz <<= 1    # double block size (converge with open ended binary search-like strategy)
            j = j - block_sz      # if this doesn't work, try using tmp j1 = file.tell() above
    return str(b''.join(reversed(blocks)))      # if newline was never found, return everything read

理想情况下，您可以将其封装在一个名为LastTextFileLine的类中，并跟踪行长度的移动平均值。这将给您一个很好的len_guess。

- user1277936

0

你能否将文件加载到mmap中，然后使用mmap.rfind(string[, start[, end]])函数在文件中查找倒数第二个EOL字符？定位到该点后，应该就是最后一行了。

- ChrisC

-3

lines = file.readlines()
fileHandle.close()
last_line = lines[-1]

- Jon Martin

2

啊！千万不要使用 lines[len(lines) -1]。那是一个 O(n) 操作。lines[-1] 可以获取最后一个元素。此外，这种方法并不比他已经使用的方法更好。 - g.d.d.c

哎呀，我的错！不过这个方法实际上更有效率。 - Jon Martin

11

lines[len(lines)-1] 不是 O(n)（除非 lines 是一个具有 O(n) 实现的用户定义类型的 __len__ 方法，但这里不是这种情况）。虽然这种写法不太好，但是 lines[len(lines)-1] 的运行时间成本与 lines[-1] 几乎相同；唯一的区别在于索引计算是由脚本显式完成还是由运行时隐式完成。 - Adam Rosenfield

然而，这听起来非常浪费内存，因为您必须在执行所述的 O(1) 操作之前将可能很大的文件读入内存。 - gustafbstrom

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- sth · Accepted Answer

55

这并不是一种直接的方法，但可能比简单的Python实现要快得多：

line = subprocess.check_output(['tail', '-1', filename])

- sth

2

你需要在末尾添加 [0:-1]，因为它似乎在末尾添加了一个 '\n'... - Carter Tazio Schonwald

4

这不是一个很pythonic的解决方案。 - Maxime de Pachtere

我非常喜欢这个函数，但是当我在共享代码中使用它时，发现了一个问题，即在Windows上使用时没有尾部函数。因此，我的首选（Python 3.7，无格式）是... with open(filename, 'r') as f: line = f.readlines()[-1] - John 9631

2

@John9631，你的解决方案非常慢，因为readlines()会读取RAM中的所有行，如果文件大小为GB级别，那么就会出现内存错误！ - Anu

2

Windows支持tail吗？ - Hojat Modaresi

@HojatModaresi，你完全可以在你的Windows电脑上安装一个“tail”程序，但是它并不自带。 - Mark Ransom