使用Python合并和排序日志文件

Question

使用Python合并和排序日志文件

12

我完全是Python的新手，我有一个无法解决的严重问题。

我有几个具有相同结构的日志文件：

[timestamp] [level] [source] message

例如：

[Wed Oct 11 14:32:52 2000] [error] [client 127.0.0.1] error message

我需要用纯Python编写一个程序，它应该将这些日志文件合并成一个文件，然后按时间戳对合并的文件进行排序。在此操作之后，我希望将此结果（合并文件的内容）打印到STDOUT（控制台）。

我不太明白如何做到这一点，希望获得帮助。这个可能吗？

- BadUX

5个回答

8

首先，您需要使用fileinput模块从多个文件中获取数据，例如：

data = fileinput.FileInput()
for line in data.readlines():
    print line

然后将会一起打印出所有行。你还想要排序，可以使用sorted关键字来完成。

假设你的每行都是以[2011-07-20 19:20:12]开头的，那么你很幸运，因为这种格式不需要进行除字母数字外的任何排序，所以可以直接使用：

data = fileinput.FileInput()
for line in sorted(data.readlines()):
    print line

然而，如果您需要做更复杂的事情：

def compareDates(line1, line2):
   # parse the date here into datetime objects
   NotImplemented
   # Then use those for the sorting
   return cmp(parseddate1, parseddate2)

data = fileinput.FileInput()
for line in sorted(data.readlines(), cmp=compareDates):
    print line

为了获得额外积分，您甚至可以这样做

data = fileinput.FileInput(openhook=fileinput.hook_compressed)

这将使您能够读取gzip压缩的日志文件。

使用方法如下：

$ python yourscript.py access.log.1 access.log.*.gz

或类似的。

- MatthewWilkes

1

如果您可以将所有日志文件加载到RAM中，那就很好。 - FogleBird

谢谢你的回答。非常有帮助。 - BadUX

如果你能确保你的数据文件是内部排序的，并且以排序顺序传递它们，那么你可以放弃cmp。FileInput将迭代而不会加载到内存中。 - MatthewWilkes

但是FileInput没有readlines()方法？它被弃用了吗？ - jodles

2

关于关键排序功能：

def sort_key(line):
    return datetime.strptime(line.split(']')[0], '[%a %b %d %H:%M:%S %Y')

这应该作为sort或sorted的key参数使用，而不是cmp。这样做更快。

哦，还有你应该有

from datetime import datetime

在您的代码中添加以下内容以使其起作用。

- Jasmijn

0

将两个文件的行读入列表中（它们现在已经合并），提供一个用户定义的比较函数，将时间戳转换为自纪元以来的秒数，使用用户定义的比较调用排序，将行写入合并文件...

def compare_func():
    # comparison code
    pass


lst = []

for line in open("file_1.log", "r"):
   lst.append(line)

for line in open("file_2.log", "r"):
   lst.append(line)

# create compare function from timestamp to epoch called compare_func

lst.sort(cmp=compare_func)  # this could be a lambda if it is simple enough

这样应该就可以了

- bjarneh

0

这里的所有其他答案在打印第一行之前读取所有日志，这可能非常慢，甚至会在日志太大时导致错误。

此解决方案使用正则表达式和strptime格式，就像上面的解决方案一样，但它会“合并”日志。

这意味着您可以将其输出管道传输到“head”或“less”，并期望它快速响应。

import typing
import time
from dataclasses import dataclass


t_fmt = "%Y%m%d.%H%M%S.%f"      # format of time stamps
t_pat = re.compile(r"([^ ]+)")  # pattern to extract timestamp

def get_time(line, prev_t):
    # uses the prev time if the time isn't found
    res = t_pat.search(line)
    if not res:
        return prev_t
    try:
        cur = time.strptime(res.group(1), t_fmt)
    except ValueError:
        return prev_t   
    return cur

def print_sorted(files):
    @dataclass
    class FInfo:
        path: str
        fh: typing.TextIO
        cur_l = ""
        cur_t = None

        def __read(self):
            self.cur_l += self.fh.readline()
            if not self.cur_l:
                # eof found, set time so file is sorted last
                self.cur_t = time.localtime(time.time() + 86400)
            else:
                self.cur_t = get_time(self.cur_l, self.cur_t)

        def read(self):
            # clear out the current line, and read
            self.cur_l = ""
            self.__read()
            while self.cur_t is None:
                self.__read()

    finfos = []
    for f in files:
        try:
            fh = open(f, "r")
        except FileNotFoundError:
            continue
        fi = FInfo(f, fh)
        fi.read()
        finfos.append(fi)

    while True:
        # get file with first log entry
        fi = sorted(finfos, key=lambda x: x.cur_t)[0]
        if not fi.cur_l:
            break
        print(fi.cur_l, end="")
        fi.read()

- Erik Aronesty

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mhyfritz · Accepted Answer

你可以这样做。

import fileinput
import re
from time import strptime

f_names = ['1.log', '2.log'] # names of log files
lines = list(fileinput.input(f_names))
t_fmt = '%a %b %d %H:%M:%S %Y' # format of time stamps
t_pat = re.compile(r'\[(.+?)\]') # pattern to extract timestamp
for l in sorted(lines, key=lambda l: strptime(t_pat.search(l).group(1), t_fmt)):
    print l,