C++和Boost：编码/解码UTF-8。

Question

C++和Boost：编码/解码UTF-8。

24

我正在尝试完成一个非常简单的任务：将支持 Unicode 的 wstring 转换为以 UTF8 字节编码的 string，然后再反过来：将包含 UTF8 字节的 string 转换为支持 Unicode 的 wstring。

问题是，我需要跨平台，并且需要使用 Boost 来实现……但我似乎无法找到一种方法使其正常工作。我一直在摸索着：

我试图将代码转换为使用 stringstream/wstringstream 而不是文件之类的东西，但似乎什么都不起作用。

例如，在 Python 中，它看起来会像这样：

>>> u"שלום"
u'\u05e9\u05dc\u05d5\u05dd'
>>> u"שלום".encode("utf8")
'\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'
>>> '\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d'.decode("utf8")
u'\u05e9\u05dc\u05d5\u05dd'

我最终想要的是这个：

wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
wstring ws(uchars);
string s = encode_utf8(ws); 
// s now holds "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d"
wstring ws2 = decode_utf8(s);
// ws2 now holds {0x5e9, 0x5dc, 0x5d5, 0x5dd}

我真的不想在ICU或类似的东西上再添加另一个依赖项...但据我所知，使用Boost应该是可能的。

非常感谢您提供一些示例代码！谢谢

- sebulba

imbueе’ҢstringstreamдёҖиө·дёҚиғҪдҪҝз”Ёеҗ—пјҹutf8зҡ„codecvt facetеҮәдәҶд»Җд№Ҳй—®йўҳпјҹ - Ben Voigt

2

请查看https://dev59.com/TXVC5IYBdhLWcg3w51ry。 - Mark Ransom

7

wchar_t/wstring在存储代码点方面是一个不好的选择，因为不能保证wchar_t足够宽以涵盖所有的码点（如果我没记错，在Windows上，对于BMP之外的码点来说，wchar_t是不够宽的）。 - etarion

4个回答

18

评论中已经提到了一个boost链接，但在几乎成为标准的C++0x中，有一个wstring_convert可以完成这项任务。

#include <iostream>
#include <string>
#include <locale>
#include <codecvt>
int main()
{
    wchar_t uchars[] = {0x5e9, 0x5dc, 0x5d5, 0x5dd, 0};
    std::wstring_convert<std::codecvt_utf8<wchar_t>> conv;
    std::string s = conv.to_bytes(uchars);
    std::wstring ws2 = conv.from_bytes(s);
    std::cout << std::boolalpha
              << (s == "\xd7\xa9\xd7\x9c\xd7\x95\xd7\x9d" ) << '\n'
              << (ws2 == uchars ) << '\n';
}

在使用MS Visual Studio 2010 EE SP1或CLang++ 2.9编译时的输出结果。

true 
true

- Cubbi

它还没有实现跨平台吗？在C++11之后？ - Mikhail

1

@Mikhail，它已经到了，但GCC还没有实现它。 - Cubbi

1

为了保持答案的最新性，目前在C++17中std::codecvt_utf8、std::wstring_convert和相关类已经被弃用。您可以使用std::codecvt代替。 - cbuchart

2

@cbuchart 不完全是这样。它们被弃用了，但没有提供替代方案，这是为了鼓励C++20的替代方案。就像std::strstream在C++98中被弃用一样，但仍然在标准中存在且没有替代方案。 - Cubbi

13

Boost.Locale发布于Boost 1.48（2011年11月15日），使得从UTF8/16转换更加容易。

这里是一些方便的示例：

string utf8_string = to_utf<char>(latin1_string,"Latin1");
wstring wide_string = to_utf<wchar_t>(latin1_string,"Latin1");
string latin1_string = from_utf(wide_string,"Latin1");
string utf8_string2 = utf_to_utf<char>(wide_string);

几乎和Python的编码/解码一样简单 :)

请注意，Boost.Locale不是一个仅有头文件的库。

- Diaa Sami

你能否在这个库中添加一个 decode_utf8 操作的示例？ - Dfr

你可以尝试使用这个表单将 utf-8 解码为 utf-16 wstring wide_string = to_utf<wchar_t>(utf8_bytes,"utf-8"); - Diaa Sami

2

如需一个处理utf8的 std::string/std::wstring 的即插即用替代品，请查看TINYUTF8。

结合<codecvt>，您可以将几乎所有编码从/转换为utf8，然后通过上述库进行处理。

- Jakob Riedle

1

嗨@jakob-riedle！我有一个关于你的库的小问题（顺便说一下，非常感谢你所付出的优雅和大量的工作！）。你能看一下这个问题吗：https://dev59.com/3Krka4cB1Zd3GeqPdnLf？ - Vadim Berman

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- sebulba · Accepted Answer

感谢大家的帮助，但最终我使用了http://utfcpp.sourceforge.net/——这是一个仅包含头文件的库，非常轻量级且易于使用。我在这里分享一个演示代码，如果有人觉得有用的话：

inline void decode_utf8(const std::string& bytes, std::wstring& wstr)
{
    utf8::utf8to32(bytes.begin(), bytes.end(), std::back_inserter(wstr));
}
inline void encode_utf8(const std::wstring& wstr, std::string& bytes)
{
    utf8::utf32to8(wstr.begin(), wstr.end(), std::back_inserter(bytes));
}

使用方法：

wstring ws(L"\u05e9\u05dc\u05d5\u05dd");
string s;
encode_utf8(ws, s);