Python:如何使用Python解析原始电子邮件源中的“From”,“To”,“Body”的内容?

4
原始电子邮件通常长这样
From root@a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
    by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
    for <ooo@a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root@localhost)
    by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
    Thu, 25 Jul 2013 19:28:59 -0700
From: root@a1.local.tld
Subject: ooooooooooooooooooooooo
To: ooo@a1.local.tld
Cc: 
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861@a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"

This is a multi-part message in MIME format.

--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

ooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooo

--bound1374805739--

所以如果我想编写一个PYTHON脚本来获取
From
To
Subject
Body

这是我需要的代码,还是有更好的方法?

a='<title>aaa</title><title>aaa2</title><title>aaa3</title>'

import re
a1 = re.findall(r'<(title)>(.*?)<(/title)>', a)

你听说过PLY或者更特别的PyParsing吗?如果你要处理大量的电子邮件,而这些邮件可能包含会破坏手工解析器的字符,那么这两个Python软件包都是很好的文件解析工具。你可能想先尝试PyParsing;它是最容易使用的。 - kirbyfan64sos
5个回答

19

我不是很理解你的最后一段代码片段与任何事情有关系 - 在那之前你没有提到任何关于HTML的内容,所以我不明白为什么你会突然举例解析HTML(不管怎样,你都不应该使用正则表达式来做这个)。

无论如何,回答你最初的问题,如果你想要从邮件消息中获取头部信息,Python标准库中已经包含了相关的代码:

import email
msg = email.message_from_string(email_string)
msg['from']  # 'root@a1.local.tld'
msg['to']    # 'ooo@a1.local.tld'

我选择这个答案是因为它的运动是直接的而不是间接的(不需要导入解析器等),这种方式更加友好。- Sumer Kolcak - user2621078
1
如何获取请求体? - Abdur-Rahmaan Janhangeer

14

幸运的是,Python让这个过程更简单:http://docs.python.org/2.7/library/email.parser.html#email.parser.Parser

from email.parser import Parser
parser = Parser()

emailText = """PUT THE RAW TEXT OF YOUR EMAIL HERE"""
email = parser.parsestr(emailText)

print email.get('From')
print email.get('To')
print email.get('Subject')

邮件正文比较棘手。调用email.is_multipart()方法。如果返回false,则可以通过调用email.get_payload()获取正文内容。但是,如果返回true,则email.get_payload()将返回一组消息,因此您需要对每个消息调用get_payload()

if email.is_multipart():
    for part in email.get_payload():
        print part.get_payload()
else:
    print email.get_payload()

2

您的样例邮件中缺少“正文”部分。

可以使用email模块:

import email
    msg = email.message_from_string(email_message_as_text)

那么使用:

print email['To']
print email['From']

... ... etc


我一直在尝试构建类似的东西,但在Python3中遇到了很多问题 - 现在应该怎么做?使用这种解决方案,我正在返回None。 - Zach Oakes

1
你可能应该使用email.parser
s = """
From root@a1.local.tld Thu Jul 25 19:28:59 2013
Received: from a1.local.tld (localhost [127.0.0.1])
    by a1.local.tld (8.14.4/8.14.4) with ESMTP id r6Q2SxeQ003866
    for <ooo@a1.local.tld>; Thu, 25 Jul 2013 19:28:59 -0700
Received: (from root@localhost)
    by a1.local.tld (8.14.4/8.14.4/Submit) id r6Q2Sxbh003865;
    Thu, 25 Jul 2013 19:28:59 -0700
From: root@a1.local.tld
Subject: ooooooooooooooooooooooo
To: ooo@a1.local.tld
Cc: 
X-Originating-IP: 192.168.15.127
X-Mailer: Webmin 1.420
Message-Id: <1374805739.3861@a1>
Date: Thu, 25 Jul 2013 19:28:59 -0700 (PDT)
MIME-Version: 1.0
Content-Type: multipart/mixed; boundary="bound1374805739"

This is a multi-part message in MIME format.

--bound1374805739
Content-Type: text/plain
Content-Transfer-Encoding: 7bit

ooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooo
ooooooooooooooooooooooooooooooo

--bound1374805739--
"""

import email.parser

msg = email.parser.Parser().parsestr(s)
help(msg)

0

你可以将原始内容写入文件中

然后像这样读取文件:

with open('in.txt', 'r') as file:
    raw = file.readlines()

get_list = ['From:','To:','Subject:']
info_list = []

for i in raw:
    for word in get_list:
        if i.startswith(word):
            info_list.append(i)

现在info_list将会是:

['From: root@a1.local.tld', 'Subject: ooooooooooooooooooooooo', 'To: ooo@a1.local.tld']

我在您的原始内容中看不到 Body:

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接