使用icu库进行UTF-8到UCS-2的转换

Question

使用icu库进行UTF-8到UCS-2的转换

c++unicodeutf-8icuucs2

3

我目前正在使用icu库将UTF-8字符串转换为UCS-2字符串，但遇到了问题。该库有多种方法可以完成此操作，但迄今为止，似乎没有一种方法是有效的。考虑到这个库的受欢迎程度，我认为我可能做错了什么。

首先是通用代码。在所有情况下，我都会创建并传递一个对象上的字符串，但在它到达转换步骤之前，没有任何操作。

当前使用的utf-8字符串仅为"ĩ"。

为了简单起见，在此代码中，我将所使用的字符串表示为uniString。

UErrorCode resultCode = U_ZERO_ERROR;

UConverter* m_pConv = ucnv_open("ISO-8859-1", &resultCode);

// Change the callback to error out instead of the default            
const void* oldContext;
UConverterFromUCallback oldFromAction;
UConverterToUCallback oldToAction;
ucnv_setFromUCallBack(m_pConv, UCNV_FROM_U_CALLBACK_STOP, NULL, &oldFromAction, &oldContext, &resultCode);
ucnv_setToUCallBack(m_pConv, UCNV_TO_U_CALLBACK_STOP, NULL, &oldToAction, &oldContext, &resultCode);

int32_t outputLength = 0;
int bodySize = uniString.length();
int targetSize = bodySize * 4;
char* target = new char[targetSize];                       

printf("Body: %s\n", uniString.c_str());
if (U_SUCCESS(resultCode))
{
    // outputLength = ucnv_convert("ISO-8859-1", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
    outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
        uniString.length(), &resultCode);
    ucnv_close(m_pConv);
}
printf("ISO-8859-1 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
    outputLength ? target : "invalid_char", resultCode, outputLength);

if (resultCode == U_INVALID_CHAR_FOUND || resultCode == U_ILLEGAL_CHAR_FOUND || resultCode == U_TRUNCATED_CHAR_FOUND)
{
    if (resultCode == U_INVALID_CHAR_FOUND)
    {
        printf("Unmapped input character, cannot be converted to Latin1");                    

        m_pConv = ucnv_open("UCS-2", &resultCode);
        if (U_SUCCESS(resultCode))
        {
            // outputLength = ucnv_convert("UCS-2", "UTF-8", target, targetSize, uniString.c_str(), bodySize, &resultCode);
            outputLength = ucnv_fromAlgorithmic(m_pConv, UCNV_UTF8, target, targetSize, uniString.c_str(),
                uniString.length(), &resultCode);
            ucnv_close(m_pConv);
        }

        printf("UCS-2 DGF just tried to convert '%s' to '%s' with error '%i' and length '%i'", uniString.c_str(), 
            outputLength ? target : "invalid_char", resultCode, outputLength);

        if (U_SUCCESS(resultCode))
        {
            pdus = SegmentText(target, pText, SEGMENT_SIZE_UNICODE_MAX, true);
        }
    }
    else
    {
        printf("DecodeText(): Text contents does not appear to be valid UTF-8");
    }
}
else
{
    printf("DecodeText(): Text successfully converted to Latin1");
    std::string newBody(target, outputLength);
    pdus = SegmentText(newBody, pPdu, SEGMENT_SIZE_MAX);
}

问题在于ucnv_fromAlgorithmic函数对于UCS-2转换会抛出错误U_INVALID_CHAR_FOUND。对于尝试使用ISO-8859-1进行转换是有道理的，但不适用于UCS-2。

另一种尝试是使用ucnv_convert，你可以看到它被注释掉了。这个函数尝试进行转换，但在ISO-8859-1尝试中没有失败，这是不应该的。

所以问题是，是否有人对这些函数有经验并发现了什么不正确的地方，或者对于这个字符的转换假设有什么不正确之处？

- MumblesCrzy

@KevinPanko 更新了问题和疑问。谢谢。 - MumblesCrzy

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Per Johansson · Answer 1

在调用ucnv_open之前，您需要将resultCode重置为U_ZERO_ERROR。引用自manual：

"ICU函数（C++中的引用或C中的指针）首先测试if(U_FAILURE(errorCode)) { return immediately; }，因此在这样的函数链中，第一个设置错误代码的函数会导致后续函数不执行任何操作"