如何获取Unicode字符的名称?

3
我记得很久以前看到过一种使用Win32 API调用获取包含Unicode字符名称的字符串的方法。我正在使用C++ Builder,如果VCL库中支持它,那也可以使用。
例如:GetUnicodeName(U+0021)将返回一个字符串(或填充结构体或类似),如“EXCLAMATION MARK”。
或者如果有其他方法可以在Windows中使用C或C++获得相同的结果。
最坏的情况是需要有一个巨大的查找表来存储感兴趣的名称(主要是拉丁字符)。

1
适用于C#的代码,但应该可以在https://dev59.com/g3I95IYBdhLWcg3w7SpZ上工作。 - Alan Birtles
4
有一个未记录的函数GetUName(int c, LPWSTR* name)从\windows\system32\getuname.dll中获取,否则https://www.unicode.org/Public/14.0.0/ucd/UnicodeData-14.0.0d1.txt只有大约1.8M大小,包含约34000行。 - Simon Mourier
1个回答

0

您可以使用未记录的GetUName方法从getuname.dll中:

std::string GetUnicodeCharacterName(wchar_t character)
{
    // https://github.com/reactos/reactos/tree/master/dll/win32/getuname
    typedef int(WINAPI* GetUNameFunc)(WORD wCharCode, LPWSTR lpBuf);
    static GetUNameFunc pfnGetUName = reinterpret_cast<GetUNameFunc>(::GetProcAddress(::LoadLibraryA("getuname.dll"), "GetUName"));

    if (!pfnGetUName)
        return {};

    std::array<WCHAR, 256> buffer;
    int length = pfnGetUName(character, buffer.data());

    return utf8::narrow(buffer.data(), length);
}

// Replace invisible code point with code point that is visible
wchar_t ReplaceInvisible(wchar_t character)
{
    if (!std::iswgraph(character))
    {
        if (character <= 0x21)
            character += 0x2400; // U+2400 Control Pictures https://www.unicode.org/charts/PDF/U2400.pdf
        else
            character = 0xFFFD; // REPLACEMENT CHARACTER
    }

    return character;
}

// Accepts in UTF-8.
// Returns UTF-8 string like this:
// q <U+71 Latin Small Letter Q>
// п <U+43F Cyrillic Small Letter Pe>
// ␈ <U+8 Backspace>
//  <U+10338 Supplementary Multilingual Plane>
//  <U+1F692 Supplementary Multilingual Plane>
std::string GetUnicodeCharacterNames(std::string string)
{
    // UTF-8 <=> UTF-32 converter
    std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> utf32conv;

    // UTF-8 to UTF-32
    std::u32string utf32string = utf32conv.from_bytes(string);

    std::string characterNames;
    characterNames.reserve(35 * utf32string.size());

    for (const char32_t& codePoint : utf32string)
    {
        if (!characterNames.empty())
            characterNames.append(", ");

        char32_t visibleCodePoint = (codePoint < 0xFFFF) ? ReplaceInvisible(static_cast<wchar_t>(codePoint)) : codePoint;
        std::string charName = (codePoint < 0xFFFF) ? GetUnicodeCharacterName(static_cast<wchar_t>(codePoint)) : "Supplementary Multilingual Plane";

        // UTF-32 to UTF-8
        std::string utf8codePoint = utf32conv.to_bytes(&visibleCodePoint, &visibleCodePoint + 1);
        characterNames.append(fmt::format("{} <U+{:X} {}>", utf8codePoint, static_cast<uint32_t>(codePoint), charName));
    }

    return characterNames;
}

缺点是它只包含Unicode基本多语言平面(BMP)中的字符。

更新:自从Fall Creators Update(版本1709 Build 16299)以来,您可以使用附带Windows的{{link1:u_charName()}} ICU API:

std::string GetUCharNameWrapper(char32_t codePoint)
{
    typedef int32_t(*u_charNameFunc)(char32_t code, int nameChoice, char* buffer, int32_t bufferLength, int* pErrorCode);
    static u_charNameFunc pfnU_charName = reinterpret_cast<u_charNameFunc>(::GetProcAddress(::LoadLibraryA("icuuc.dll"), "u_charName"));

    if (!pfnU_charName)
        return {};

    int errorCode = 0;
    std::array<char, 512> buffer;
    int32_t length = pfnU_charName(codePoint, 0/*U_UNICODE_CHAR_NAME*/ , buffer.data(), static_cast<int32_t>(buffer.size() - 1), &errorCode);

    if (errorCode != 0)
        return {};

    return std::string(buffer.data(), length);
}

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接