C++解码电子邮件主题

4
我使用 Poco/Net/POP3ClientSession 下载了邮件,我需要将邮件主题转换为易于理解的格式,因此我尝试使用 neagoegab 在这里提供的解决方案:https://dev59.com/wF3Ua4cB1Zd3GeqP9ym0#8104496。遗憾的是它并没有起作用。
#include <Poco/Net/POP3ClientSession.h>
#include <Poco/Net/MailMessage.h>
#include <iostream>
#include <string>
using namespace std;
using namespace Poco::Net;


#include <iconv.h>

const size_t BUF_SIZE=1024;


class IConv {
    iconv_t ic_;
public:
    IConv(const char* to, const char* from)
        : ic_(iconv_open(to,from))    { }
    ~IConv() { iconv_close(ic_); }

     bool convert(char* input, char* output, size_t& out_size) {
        size_t inbufsize = strlen(input)+1;
        return iconv(ic_, &input, &inbufsize, &output, &out_size);
     }
};


int main()
{
    POP3ClientSession session("poczta.o2.pl");
    session.login("my mail", "my password");

    POP3ClientSession::MessageInfoVec messages;
    session.listMessages(messages);
    cout << "id: " << messages[0].id << " size: " << messages[0].size << endl;

    MailMessage message;
    session.retrieveMessage(messages[0].id, message);
    const string subject = message.getSubject();


    cout << "Original subject: " << subject << endl;

    IConv iconv_("UTF8","ISO-8859-2");


    char from[BUF_SIZE];// "=?ISO-8859-2?Q?Re: M=F3j sen o JP II?=";
    subject.copy(from, sizeof(from));
    char to[BUF_SIZE] = "bye";
    size_t outsize = BUF_SIZE;//you will need it

    iconv_.convert(from, to, outsize);
    cout << "converted: " << to << endl;
}

输出结果为:
id: 1 size: 2792
Original subject: =?ISO-8859-2?Q?Re: M=F3j sen o JP II?=
converted: =?ISO-8859-2?Q?Re: M=F3j sen o JP II?=

有趣的是,当我尝试使用POCO转换主题时,它失败了:
cout << "Encoded with POCO: " << MailMessage::encodeWord("Re: Mój sen o JP II", "ISO-8859-2") << endl; // output: Encoded with POCO: =?ISO-8859-2?q?Re=3A_M=C3=B3j_sen_o_JP_II?=

但是我想要接收的主题是:"Re: Mój sen o JP II"。

我找到的唯一成功的方法是:

https://docs.python.org/2/library/email.header.html#email.header.decode_header

所以我的问题是 - 如何在C++中将电子邮件的主题转换为像UTF-8这样的格式?


找到相关的RFC,编写代码。据我回忆,邮件和NNTP消息使用稍微不同的约定。 - Cheers and hth. - Alf
在你自己编写任何代码之前,先研究一下是否有人已经为你完成了这项工作。特别是对于已经建立的RFCs,存在许多现有的实现。 - Roland Illig
1
我刚刚提交了 https://github.com/pocoproject/poco/issues/1543。 - Roland Illig
从技术上讲,这些空格不符合编码单词的规范,然而,任何真正的库都应该能够处理它们。 - Max
1
问题已于2017年11月份得到解决。您应该更新到1.9.0,并使您的代码更简单易懂。 - Roland Illig
3个回答

4
您的情况相关的RFC是RFC 2047。该RFC规定了邮件消息中如何编码非ASCII数据。基本要点是,除可打印的ASCII字符外,所有字节都应转义为一个“=”字符后跟两个十六进制数字。由于“ó”在ISO-8859-2中用字节0xF3表示,而0xF3不是可打印的ASCII字符,因此它被编码为“=F3”。您需要解码消息中的所有编码字符。

0

我找到了解决问题的方法(我不确定它是否是100%正确的解决方案),但看起来使用Poco :: UTF8Encoding :: convert从characterCode转换为utf8就足够了:

#include <Poco/Net/POP3ClientSession.h>
#include <Poco/Net/MessageHeader.h>
#include <Poco/Net/MailMessage.h>
#include <Poco/UTF8Encoding.h>
#include <iostream>
#include <string>

using namespace std;
using namespace Poco::Net;

class EncoderLatin2
{
public:
    EncoderLatin2(const string& encodedSubject)
    {
        ///    encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
        int charsetBeginPosition = strlen("=?");
        int charsetEndPosition = encodedSubject.find("?", charsetBeginPosition);
        charset = encodedSubject.substr(charsetBeginPosition, charsetEndPosition-charsetBeginPosition);

        int encodingPosition = charsetEndPosition + strlen("?");
        encoding = encodedSubject[encodingPosition];

        if ("ISO-8859-2" != charset)
            throw std::invalid_argument("Invalid encoding!");

        const int lenghtOfEncodedText = encodedSubject.length() - encodingPosition-strlen("?=")-2;
        extractedEncodedSubjectToConvert = encodedSubject.substr(encodingPosition+2, lenghtOfEncodedText);
    }

    string convert()
    {
        size_t positionOfAssignment = -1;

        while (true)
        {
            positionOfAssignment = extractedEncodedSubjectToConvert.find('=', positionOfAssignment+1);
            if (string::npos != positionOfAssignment)
            {
                const string& charHexCode = extractedEncodedSubjectToConvert.substr(positionOfAssignment + 1, 2);
                replaceAllSubstringsWithUnicode(extractedEncodedSubjectToConvert, charHexCode);
            }
            else
                break;
        }
        return extractedEncodedSubjectToConvert;
    }

    void replaceAllSubstringsWithUnicode(string& s, const string& charHexCode)
    {
        const int charCode = stoi(charHexCode, nullptr, 16);

        char buffer[10] = {};
        encodingConverter.convert(charCode, (unsigned char*)buffer, sizeof(buffer));
        replaceAll(s, '=' + charHexCode, buffer);
    }

    void replaceAll(string& s, const string& replaceFrom, const string& replaceTo)
    {
        size_t needlePosition = -1;
        while (true)
        {
            needlePosition = s.find(replaceFrom, needlePosition + 1);
            if (string::npos == needlePosition)
                break;

            s.replace(needlePosition, replaceFrom.length(), replaceTo);
        }
    }


private:
    string charset;
    char encoding;
    Poco::UTF8Encoding encodingConverter;

    string extractedEncodedSubjectToConvert;
};

int main()
{
    POP3ClientSession session("poczta.o2.pl");
    session.login("my mail", "my password");


    POP3ClientSession::MessageInfoVec messages;
    session.listMessages(messages);

    MessageHeader header;
    MailMessage message;

    auto currentMessage = messages[0];

    session.retrieveHeader(currentMessage.id, header);
    session.retrieveMessage(currentMessage.id, message);

    const string subject = message.getSubject();

    EncoderLatin2 encoder(subject);
    cout << "Original subject: " << subject << endl;
    cout << "Encoded: " << encoder.convert() << endl;
}

-1

我找到了比以前更好的解决方案。 我发现有些电子邮件主题具有不同的编码方式:

  • Latin2编码方式如下:=?ISO-8859-2?Q?...?=
  • UTF-8 Base64编码方式如下: =?utf-8?B?Wm9iYWN6Y2llIGNvIGRsYSBXYXMgcHJ6eWdvdG93YWxpxZtteSAvIHN0eWN6ZcWEIHcgTGFzZXJwYXJrdQ==?=
  • UTF-8 quoted printable编码方式如下: =?utf-8?Q?...?=
  • 没有编码(如果只有ASCII字符)则为:...

因此,使用POCO(Base64Decoder、Latin2Encoding、UTF8Encoding、QuotedPrintableDecoder),我成功地转换了所有情况:

#include <iostream>
#include <string>
#include <sstream>

#include <Poco/Net/POP3ClientSession.h>
#include <Poco/Net/MessageHeader.h>
#include <Poco/Net/MailMessage.h>
#include <Poco/Base64Decoder.h>
#include <Poco/Latin2Encoding.h>
#include <Poco/UTF8Encoding.h>
#include <Poco/Net/QuotedPrintableDecoder.h>

using namespace std;

class Encoder
{
public:
    Encoder(const string& encodedText)
    {
        isStringEncoded = isEncoded(encodedText);
        if (!isStringEncoded)
        {
            extractedEncodedSubjectToConvert = encodedText;
            return;
        }

        splitEncodedText(encodedText);
    }

    string convert()
    {
        if (isStringEncoded)
        {
            if (Poco::Latin2Encoding().isA(charset))
                return decodeFromLatin2();
            if (Poco::UTF8Encoding().isA(charset))
                return decodeFromUtf8();
        }

        return extractedEncodedSubjectToConvert;
    }

private:
    void splitEncodedText(const string& encodedText)
    {
        ///    encoded-word = "=?" charset "?" encoding "?" encoded-text "?="
        const int charsetBeginPosition = strlen(sequenceBeginEncodedText);
        const int charsetEndPosition = encodedText.find("?", charsetBeginPosition);
        charset = encodedText.substr(charsetBeginPosition, charsetEndPosition-charsetBeginPosition);

        const int encodingPosition = charsetEndPosition + strlen("?");
        encoding = encodedText[encodingPosition];

        const int lenghtOfEncodedText = encodedText.length() - encodingPosition-strlen(sequenceBeginEncodedText)-strlen(sequenceEndEncodedText);
        extractedEncodedSubjectToConvert = encodedText.substr(encodingPosition+2, lenghtOfEncodedText);
    }

    bool isEncoded(const string& encodedSubject)
    {
        if (encodedSubject.size() < 4)
            return false;

        if (0 != encodedSubject.find(sequenceBeginEncodedText))
            return false;

        const unsigned positionOfLastTwoCharacters = encodedSubject.size() - strlen(sequenceEndEncodedText);
        return positionOfLastTwoCharacters == encodedSubject.rfind(sequenceEndEncodedText);
    }

    string decodeFromLatin2()
    {
        size_t positionOfAssignment = -1;
        while (true)
        {
            positionOfAssignment = extractedEncodedSubjectToConvert.find('=', positionOfAssignment+1);
            if (string::npos != positionOfAssignment)
            {
                const string& charHexCode = extractedEncodedSubjectToConvert.substr(positionOfAssignment + 1, 2);
                replaceAllSubstringsWithUnicode(extractedEncodedSubjectToConvert, charHexCode);
            }
            else
                break;
        }
        return extractedEncodedSubjectToConvert;
    }

    void replaceAllSubstringsWithUnicode(string& s, const string& charHexCode)
    {
        static Poco::UTF8Encoding encodingConverter;
        const int charCode = stoi(charHexCode, nullptr, 16);

        char buffer[10] = {};
        encodingConverter.convert(charCode, (unsigned char*)buffer, sizeof(buffer));
        replaceAll(s, '=' + charHexCode, buffer);
    }

    void replaceAll(string& s, const string& replaceFrom, const string& replaceTo)
    {
        size_t needlePosition = -1;
        while (true)
        {
            needlePosition = s.find(replaceFrom, needlePosition + 1);
            if (string::npos == needlePosition)
                break;

            s.replace(needlePosition, replaceFrom.length(), replaceTo);
        }
    }

    string decodeFromUtf8()
    {
        if('B' == toupper(encoding))
        {
            return decodeFromBase64();
        }
        else // if Q:
        {
            return decodeFromQuatedPrintable();
        }
    }

    string decodeFromBase64()
    {
        istringstream is(extractedEncodedSubjectToConvert);
        Poco::Base64Decoder e64(is);

        extractedEncodedSubjectToConvert.clear();
        string buffer;
        while(getline(e64, buffer))
            extractedEncodedSubjectToConvert += buffer;
        return extractedEncodedSubjectToConvert;
    }

    string decodeFromQuatedPrintable()
    {
        replaceAll(extractedEncodedSubjectToConvert, "_", " ");


        istringstream is(extractedEncodedSubjectToConvert);
        Poco::Net::QuotedPrintableDecoder qp(is);

        extractedEncodedSubjectToConvert.clear();
        string buffer;
        while(getline(qp, buffer))
            extractedEncodedSubjectToConvert += buffer;
        return extractedEncodedSubjectToConvert;
    }


private:
    string charset;
    char encoding;

    string extractedEncodedSubjectToConvert;
    bool isStringEncoded;

    static constexpr const char* sequenceBeginEncodedText = "=?";
    static constexpr const char* sequenceEndEncodedText   = "?=";
};

int main()
{
    Poco::Net::POP3ClientSession session("poczta.o2.pl");
    session.login("my mail", "my password");

    Poco::Net::POP3ClientSession::MessageInfoVec messages;
    session.listMessages(messages);

    Poco::Net::MessageHeader header;
    Poco::Net::MailMessage message;

    auto currentMessage = messages[0];

    session.retrieveHeader(currentMessage.id, header);
    session.retrieveMessage(currentMessage.id, message);    

    const string subject = message.getSubject();

    Encoder encoder(subject);
    cout << "Original subject: " << subject << endl;
    cout << "Encoded: " << encoder.convert() << endl;
}

这个特性不应该内置到POCO库中吗?每个电子邮件解析器都需要它,并且需要以相同的方式。因此,让每个应用程序再次编写相同的代码没有任何意义。 - Roland Illig
确实,应该有一些内置的更易于使用的东西。我找到的所有内容都是如何编码邮件消息的单词:https://pocoproject.org/docs/Poco.Net.MailMessage.html#22506,但没有以可移植的方式解码的方法。 - baziorek

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接