使用Python解析文本文件

4

我是一名Python新手,希望使用它来解析文本文件。该文件的格式如下,共有250-300行:

---- Mark Grey (mark.grey@gmail.com) changed status from Busy to Available @ 14/07/2010 16:32:36 ----
----  Silvia Pablo (spablo@gmail.com) became Available @ 14/07/2010 16:32:39 ----

我需要将来自此文件的所有条目的以下信息存储到另一个文件中(Excel或文本)
UserName/ID  Previous Status New Status Date Time

所以,对于以上条目,我的结果文件应该如下所示。
Mark Grey/mark.grey@gmail.com  Busy Available 14/07/2010 16:32:36
Silvia Pablo/spablo@gmail.com  NaN  Available 14/07/2010 16:32:39

提前感谢您的帮助,

非常感谢任何帮助。


1
编辑说明:Marcelo和Tim为您提供了一个非常好的答案,可以帮助您完成所需的操作。以下是Python中包含的正则表达式库的文档,这可能有助于您进一步扩展代码: http://docs.python.org/library/re.html - Andrei Sosnin
好的,这不是一个数字 :) - Tim Pietzcker
6个回答

16

为了让您开始:

result = []
regex = re.compile(
    r"""^-*\s+
    (?P<name>.*?)\s+
    \((?P<email>.*?)\)\s+
    (?:changed\s+status\s+from\s+(?P<previous>.*?)\s+to|became)\s+
    (?P<new>.*?)\s+@\s+
    (?P<date>\S+)\s+
    (?P<time>\S+)\s+
    -*$""", re.VERBOSE)
with open("inputfile") as f:
    for line in f:
        match = regex.match(line)
        if match:
            result.append([
                match.group("name"),
                match.group("email"),
                match.group("previous")
                # etc.
            ])
        else:
            # Match attempt failed

这将给你一个匹配部分的数组。然后我建议你使用csv模块将结果存储为标准格式。


6
import re

pat = re.compile(r"----\s+(.*?) \((.*?)\) (?:changed status from (\w+) to|became) (\w+) @ (.*?) ----\s*")
with open("data.txt") as f:
    for line in f:
        (name, email, prev, curr, date) = pat.match(line).groups()
        print "{0}/{1}  {2} {3} {4}".format(name, email, prev or "NaN", curr, date)

这假设了空格的情况,并且假定每一行都符合该模式。如果您想要优雅地处理脏数据,您可能需要添加错误检查(例如检查pat.match()是否返回None)。


6

感兴趣的两个 RE 模式似乎是...:

p1 = r'^---- ([^(]+) \(([^)]+)\) changed status from (\w+) to (\w+) (\S+) (\S+) ----$'
p2 = r'^---- ([^(]+) \(([^)]+)\) became (\w+) (\S+) (\S+) ----$'

所以我会这样做:
import csv, re, sys

# assign p1, p2 as above (or enhance them, etc etc)

r1 = re.compile(p1)
r2 = re.compile(p2)
data = []

with open('somefile.txt') as f:
    for line in f:
        m = p1.match(line)
        if m:
            data.append(m.groups())
            continue
        m = p2.match(line)
        if not m:
            print>>sys.stderr, "No match for line: %r" % line
            continue
        listofgroups = m.groups()
        listofgroups.insert(2, 'NaN')
        data.append(listofgroups)

with open('result.csv', 'w') as f:
    w = csv.writer(f)
    w.writerow('UserName/ID Previous Status New Status Date Time'.split())
    w.writerows(data)

如果我描述的这两种模式不够普遍,当然需要进行调整,但我认为这种通用方法会很有用。虽然许多 Python 用户在 Stack Overflow 上非常不喜欢 REs,但我发现它们对于这种实用的即席文本处理非常有用。
也许其他人之所以不喜欢是因为他们想要将 RE 用于荒谬的用途,比如即席解析 CSV、HTML、XML 等结构化文本格式——而对于这些格式已经存在完美的解析器!还有其他一些超出了 RE 的“舒适区”的任务,需要使用像 pyparsing 这样的坚实的通用解析器系统。或者在另一个极端,使用简单字符串可以完美地完成超级简单的任务(例如,我记得最近有一个 SO 问题使用 if re.search('something', s): 而不是 if 'something' in s:!-)。
但对于合理广泛的任务范围(排除了一端的非常简单的任务和另一端的结构化或略微复杂的语法解析),RE 是适当的,使用它们并没有什么问题,我建议所有程序员至少学习 RE 的基础知识。

4

Alex提到了pyparsing,这里提供一个使用pyparsing的解决方案来解决你的问题:

from pyparsing import Word, Suppress, Regex, oneOf, SkipTo
import datetime

DASHES = Word('-').suppress()
LPAR,RPAR,AT = map(Suppress,"()@")
date = Regex(r'\d{2}/\d{2}/\d{4}')
time = Regex(r'\d{2}:\d{2}:\d{2}')
status = oneOf("Busy Available Idle Offline Unavailable")

statechange1 = 'changed status from' + status('fromstate') + 'to' + status('tostate')
statechange2 = 'became' + status('tostate')
linefmt = (DASHES + SkipTo('(')('name') + LPAR + SkipTo(RPAR)('email') + RPAR + 
            (statechange1 | statechange2) +
            AT + date('date') + time('time') + DASHES)

def convertFields(tokens):
    if 'fromstate' not in tokens:
        tokens['fromstate'] = 'NULL'
    tokens['name'] = tokens.name.strip()
    tokens['email'] = tokens.email.strip()
    d,mon,yr = map(int, tokens.date.split('/'))
    h,m,s = map(int, tokens.time.split(':'))
    tokens['datetime'] = datetime.datetime(yr, mon, d, h, m, s)
linefmt.setParseAction(convertFields)

for line in text.splitlines():
    fields = linefmt.parseString(line)
    print "%(name)s/%(email)s  %(fromstate)-10.10s %(tostate)-10.10s %(datetime)s" % fields

打印:

Mark Grey/mark.grey@gmail.com  Busy       Available  2010-07-14 16:32:36
Silvia Pablo/spablo@gmail.com  NULL       Available  2010-07-14 16:32:39

pyparsing允许您为结果字段附加名称(就像Tom Pietzcker的RE-styled答案中的命名组一样),并且可以在解析时执行操作以对解析后的操作进行操作或操作 - 请注意,将单独的日期和时间字段转换为真正的datetime对象,已经转换并准备好在解析后进行处理,无需任何额外的麻烦。

这是一个修改后的循环,它只会转储出每行的解析令牌和命名字段:

for line in text.splitlines():
    fields = linefmt.parseString(line)
    print fields.dump()

打印:

['Mark Grey ', 'mark.grey@gmail.com', 'changed status from', 'Busy', 'to', 'Available', '14/07/2010', '16:32:36']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:36
- email: mark.grey@gmail.com
- fromstate: Busy
- name: Mark Grey
- time: 16:32:36
- tostate: Available
['Silvia Pablo ', 'spablo@gmail.com', 'became', 'Available', '14/07/2010', '16:32:39']
- date: 14/07/2010
- datetime: 2010-07-14 16:32:39
- email: spablo@gmail.com
- fromstate: NULL
- name: Silvia Pablo
- time: 16:32:39
- tostate: Available

我猜测,当您继续解决这个问题时,您可能会发现输入文本格式的其他变化,以指定用户状态的更改。在这种情况下,您只需要添加另一个定义,例如statechange1statechange2,并将其与其他内容一起插入到linefmt中。我认为pyparsing对解析器定义的结构化有助于开发人员在事物发生变化后回到解析器,并轻松扩展其解析程序。


1
非常感谢你们的所有评论。它们非常有用。我使用目录功能编写了我的代码。它的作用是读取文件并为每个用户创建一个包含他所有状态更新的输出文件。以下是代码的粘贴内容。
#Script to extract info from individual data files and print out a data file combining info from these files

import os
import commands

dataFileDir="data/";

#Dictionary linking names to email ids
#For the time being, assume no 2 people have the same name
usrName2Id={};

#User id  to user name mapping to check for duplicate names
usrId2Name={};

#Store info: key: user ids and values a dictionary with time stamp keys and status messages values
infoDict={};

#Given an array of space tokenized inputs, extract user name
def getUserName(info,mailInd):

    userName="";
    for i in range(mailInd-1,0,-1):

        if info[i].endswith("-") or info[i].endswith("+"):
            break;

        userName=info[i]+" "+userName;

    userName=userName.strip();
    userName=userName.replace("  "," ");
    userName=userName.replace(" ","_");

    return userName;

#Given an array of space tokenized inputs, extract time stamp
def getTimeStamp(info,timeStartInd):
    timeStamp="";
    for i in range(timeStartInd+1,len(info)):
        timeStamp=timeStamp+" "+info[i];

    timeStamp=timeStamp.replace("-","");
    timeStamp=timeStamp.strip();
    return timeStamp;

#Given an array of space tokenized inputs, extract status message
def getStatusMsg(info,startInd,endInd):
    msg="";
    for i in range(startInd,endInd):
        msg=msg+" "+info[i];
    msg=msg.strip();
    msg=msg.replace(" ","_");
    return msg;

#Extract and store info from each line in the datafile
def extractLineInfo(line):

    print line;
    info=line.split(" ");

    mailInd=-1;userId="-NONE-";
    timeStartInd=-1;timeStamp="-NONE-";
    becameInd="-1";
    statusMsg="-NONE-";

    #Find indices of email id and "@" char indicating start of timestamp
    for i in range(0,len(info)):
        #print (str(i)+" "+info[i]);
        if(info[i].startswith("(") and info[i].endswith("@in.ibm.com)")):
            mailInd=i;
        if(info[i]=="@"):
            timeStartInd=i;

        if(info[i]=="became"):
            becameInd=i;

    #Debug print of mail and time stamp start inds
    """print "\n";
    print "Index of mail id: "+str(mailInd);
    print "Index of time start index: "+str(timeStartInd);
    print "\n";"""

    #Extract IBM user id and name for lines with ibm id
    if(mailInd>=0):
        userId=info[mailInd].replace("(","");
        userId=userId.replace(")","");
        userName=getUserName(info,mailInd);
    #Lines with no ibm id are of the form "Suraj Godar Mr became idle @ 15/07/2010 16:30:18"
    elif(becameInd>0):
        userName=getUserName(info,becameInd);

    #Time stamp info
    if(timeStartInd>=0):
        timeStamp=getTimeStamp(info,timeStartInd);
        if(mailInd>=0):
            statusMsg=getStatusMsg(info,mailInd+1,timeStartInd);
        elif(becameInd>0):
            statusMsg=getStatusMsg(info,becameInd,timeStartInd);

    print userId;
    print userName;
    print timeStamp
    print statusMsg+"\n";

    if not(userName in usrName2Id) and not(userName=="-NONE-") and not(userId=="-NONE-"):
        usrName2Id[userName]=userId;

    #Store status messages keyed by user email ids
    timeDict={};

    #Retrieve user id corresponding to user name
    if userName in usrName2Id:
        userId=usrName2Id[userName];

    #For valid user ids, store status message in the dict within dict data str arrangement
    if not(userId=="-NONE-"):

        if not(userId in infoDict.keys()):
            infoDict[userId]={};

        timeDict=infoDict[userId];
        if not(timeStamp in timeDict.keys()):
            timeDict[timeStamp]=statusMsg;
        else:
            timeDict[timeStamp]=timeDict[timeStamp]+" "+statusMsg;


#Print for each user a file containing status
def printStatusFiles(dataFileDir):


    volNum=0;

    for userName in usrName2Id:
        volNum=volNum+1;

        filename=dataFileDir+"/"+"status-"+str(volNum)+".txt";
        file = open(filename,"w");

        print "Printing output file name: "+filename;
        print volNum,userName,usrName2Id[userName]+"\n";
        file.write(userName+" "+usrName2Id[userName]+"\n");

        timeDict=infoDict[usrName2Id[userName]];
        for time in sorted(timeDict.keys()):
            file.write(time+" "+timeDict[time]+"\n");


#Read and store data from individual data files
def readDataFiles(dataFileDir):

    #Process each datafile
    files=os.listdir(dataFileDir)
    files.sort();
    for i in range(0,len(files)):
    #for i in range(0,1):

        file=files[i];

        #Do not process other non-data files lying around in that dir
        if not file.endswith(".txt"):
            continue

        print "Processing data file: "+file
        dataFile=dataFileDir+str(file);
        inpFile=open(dataFile,"r");
        lines=inpFile.readlines();

        #Process lines
        for line in lines:

            #Clean lines
            line=line.strip();
            line=line.replace("/India/Contr/IBM","");
            line=line.strip();

            #Skip header line of the file and L's sign in sign out times
            if(line.startswith("System log for account") or line.find("signed")>-1):
                continue;


            extractLineInfo(line);


print "\n";
readDataFiles(dataFileDir);
print "\n";
printStatusFiles("out/");

@yhw42,你在这篇古老的帖子中编辑了什么?我很好奇。顺便说一下,自2010年8月以来,这位发帖人再也没有出现过。 - eyquem
@eyquem:它(http://stackoverflow.com/suggested-edits/32961)对我大喊大叫,所以我修复了格式。`:)` - yhw42
@yhw42 那确实很糟糕。感谢您的解释。 - eyquem

1
如果我要解决这个问题,我可能会首先将每个条目分割成自己的独立字符串。这看起来可能是基于行的,因此inputfile.split('\n')可能就足够了。从那里开始,我可能会编写一个正则表达式来匹配每个可能的状态更改,并用子组包装每个重要字段。

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接