从mbox文件中提取电子邮件正文,将其解码为纯文本,无论字符集和内容传输编码如何。

16

我正在尝试使用Python 3从Thunderbird mbox文件中提取电子邮件消息的正文。这是一个IMAP帐户。

我想要将电子邮件正文的文本部分作为Unicode字符串可用于处理。它应该像在Thunderbird中一样“看起来”,而不包含转义字符,如\r\n、=20等。

我认为我不知道如何解码或删除Content Transfer Encodings。我收到各种不同Content Types和不同Content Transfer Encodings的电子邮件。这是我的当前尝试:

import mailbox
import quopri,base64

def myconvert(encoded,ContentTransferEncoding):
    if ContentTransferEncoding == 'quoted-printable':
        result = quopri.decodestring(encoded)
    elif ContentTransferEncoding == 'base64':
        result = base64.b64decode(encoded)

mboxfile = 'C:/Users/Username/Documents/Thunderbird/Data/profile/ImapMail/server.name/INBOX'

for msg in mailbox.mbox(mboxfile):
    if msg.is_multipart():    #Walk through the parts of the email to find the text body.
        for part in msg.walk():
            if part.is_multipart(): # If part is multipart, walk through the subparts.
                for subpart in part.walk():
                    if subpart.get_content_type() == 'text/plain':
                        body = subpart.get_payload() # Get the subpart payload (i.e the message body)
                    for k,v in subpart.items():
                            if k == 'Content-Transfer-Encoding':
                                cte = v             # Keep the Content Transfer Encoding
            elif subpart.get_content_type() == 'text/plain':
                body = part.get_payload()           # part isn't multipart Get the payload
                for k,v in part.items():
                    if k == 'Content-Transfer-Encoding':
                        cte = v                      # Keep the Content Transfer Encoding

print(body)
print('Body is of type:',type(body))
body = myconvert(body,cte)
print(body)

但是这样会出现以下错误:
Body is of type: <class 'str'>
Traceback (most recent call last):
File "C:/Users/David/Documents/Python/test2.py", line 31, in <module>
  body = myconvert(body,cte)
File "C:/Users/David/Documents/Python/test2.py", line 6, in myconvert
  result = quopri.decodestring(encoded)
File "C:\Python32\lib\quopri.py", line 164, in decodestring
  return a2b_qp(s, header=header)
TypeError: 'str' does not support the buffer interface

很奇怪。get_payload() 应该返回字节,但在 Python 3 下却返回 str,除非你传入 decode=True,但你没有这样做。 - Lennart Regebro
我刚刚尝试了使用decode=True,它确实返回了字节,因此没有错误。看起来解码已经完成,现在我需要做的就是将字节转换为字符串。虽然我还没有测试过具有各种内容编码的电子邮件。 - DCB
哎呀,那似乎是个bug,应该反过来,decode=True应该返回字符串,而decode=False则返回字节。 :-) - Lennart Regebro
这很有帮助,谢谢。我意识到我在这个问题上花了很多时间,因为我没有花足够的时间理解一些基础知识。现在看来,我需要获取字符集,然后使用body.decode(charset)。这对大多数电子邮件都有效,但是在某些电子邮件中,我会收到AttributeError:我认为这是由于电子邮件中的字符来自另一个字符集。 - DCB
我找到了这个信息:大多数非多部分类型的消息都会被解析为一个带有字符串负载的单个消息对象。这些对象将返回is_multipart()为False。它们的get_payload()方法将返回一个字符串对象。 所有多部分类型的消息都将被解析为一个容器消息对象,其负载为子消息对象列表。外部容器消息将返回is_multipart()为True,它们的get_payload()方法将返回Message子部分的列表。来自Python Docs - DCB
那很可能是文档错误。 - Lennart Regebro
2个回答

23

这里有一些能够完成任务的代码,并且它会在那些可能失败的信息中打印错误,而不是崩溃。我希望这能够有用。请注意,如果Python 3中存在一个漏洞并且被修复,则.get_payload(decode=True)可能会返回str对象而不是bytes对象。我今天在2.7.2和Python 3.2.1上运行了此代码。

import mailbox

def getcharsets(msg):
    charsets = set({})
    for c in msg.get_charsets():
        if c is not None:
            charsets.update([c])
    return charsets

def handleerror(errmsg, emailmsg,cs):
    print()
    print(errmsg)
    print("This error occurred while decoding with ",cs," charset.")
    print("These charsets were found in the one email.",getcharsets(emailmsg))
    print("This is the subject:",emailmsg['subject'])
    print("This is the sender:",emailmsg['From'])

def getbodyfromemail(msg):
    body = None
    #Walk through the parts of the email to find the text body.    
    if msg.is_multipart():    
        for part in msg.walk():

            # If part is multipart, walk through the subparts.            
            if part.is_multipart(): 

                for subpart in part.walk():
                    if subpart.get_content_type() == 'text/plain':
                        # Get the subpart payload (i.e the message body)
                        body = subpart.get_payload(decode=True) 
                        #charset = subpart.get_charset()

            # Part isn't multipart so get the email body
            elif part.get_content_type() == 'text/plain':
                body = part.get_payload(decode=True)
                #charset = part.get_charset()

    # If this isn't a multi-part message then get the payload (i.e the message body)
    elif msg.get_content_type() == 'text/plain':
        body = msg.get_payload(decode=True) 

   # No checking done to match the charset with the correct part. 
    for charset in getcharsets(msg):
        try:
            body = body.decode(charset)
        except UnicodeDecodeError:
            handleerror("UnicodeDecodeError: encountered.",msg,charset)
        except AttributeError:
             handleerror("AttributeError: encountered" ,msg,charset)
    return body    


#mboxfile = 'C:/Users/Username/Documents/Thunderbird/Data/profile/ImapMail/server.name/INBOX'
print(mboxfile)
for thisemail in mailbox.mbox(mboxfile):
    body = getbodyfromemail(thisemail)
    print(body[0:1000])

4

这个脚本似乎能正确返回所有的消息:

def getcharsets(msg):
    charsets = set({})
    for c in msg.get_charsets():
        if c is not None:
            charsets.update([c])
    return charsets

def getBody(msg):
    while msg.is_multipart():
        msg=msg.get_payload()[0]
    t=msg.get_payload(decode=True)
    for charset in getcharsets(msg):
        t=t.decode(charset)
    return t

以前的acd答案经常只返回真实消息的一些页脚。(至少在我打开这个工具箱的GMANE电子邮件信息中:https://pypi.python.org/pypi/gmane)

干杯


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接