C++ Unicode 问题

Question

C++ Unicode 问题

c++unicodewofstream

6

我知道ICU和像code project上的utf8这样的小型库，但这些都不是我想要的。

我真正想要的是像ICU一样的东西，但包装得更友好。

具体来说：

完全面向对象
实现c++标准流，或至少执行相同的任务。
可以以本地化相关的方式格式化时间、日期等（例如，在英国使用dd/mm/yy，在美国使用mm/dd/yy）。
让我选择字符串的“内部”编码，这样我就可以在Windows上使用UTF-16，避免在将字符串传递给和从Windows API和DirectX交换时进行大量转换。
方便地在编码之间转换字符串。

如果没有这样的库存在，是否有可能使用标准c++类封装ICU，以便我可以创建一个与std::string和std::wstring具有相同用法的ustring，并实现流的版本（最理想的情况是它们与现有的流完全兼容，即我可以将其传递给期望std::ostream的函数，并且它会在内部格式和ascii（或utf-8）之间进行转换）？假设这是可能的，那么需要多少工作？

编辑：此外，经过查看c++0x标准并注意到utf8、utf16和utf32字面值，这是否意味着标准库（例如字符串、流等）将完全支持这些编码以及它们之间的转换？如果是这样，有人知道Visual Studio何时将支持这些功能吗？

编辑2：至于使用现有的c++支持，我会查找区域设置和facet相关的内容。

我遇到的一个问题是，在使用围绕wchar_t定义的流进行文件I/O时，虽然在Windows下wchar_t是2个字节，但似乎仍然使用ascii进行文件本身。

std::wofstream file(L"myfile.txt", std::ios::out);
file << L"Hello World!" << std::endl;

导致文件中出现以下十六进制内容：
48 65 6C 6C 6F 20 57 6F 72 6C 64 0D 0A
这显然是ASCII码，而不是预期的UTF-16输出：
FF FE 48 00 65 00 6C 00 6C 00 6F 00 20 00 57 00 6F 00 72 00 6C 00 64 00 0D 00 0A 00

- Fire Lancer

UTF-16文本实际上是被转换为本地8位编码的！因此，您不要将utf-16写入文件。别忘了调用std::locale::global(std::locale())。 - Artyom

好的，那么我该如何告诉它我想要文件使用哪种编码方式呢？我尝试了你上面提到的std::local...，但似乎没有任何效果 :( - Fire Lancer

好的，举个例子，如果系统语言环境为ru_RU.UTF-8，则编码为utf-8；如果它是ru_RU.KOI-8，那么它就是KOI。您还可以指定其他语言环境： locale::globale(locale("de_DE.ISO-8859-1"));（注意，我使用POSIX名称语言环境名称，对于Windows，您应该检查语言环境名称是什么）。 - Artyom

好的，那么我该如何获取当前语言环境（en_Us、en_Uk等），并将其转换为utf-16以进行宽文件io（对于窄流则使用assci/utf-8）？ - Fire Lancer

6个回答

2

这是我使用ICU在std :: string（UTF-8格式）和std :: wstring之间进行转换的方法。

/** Converts a std::wstring into a std::string with UTF-8 encoding.
 */
template < typename StringT >
StringT utf8 ( std::wstring const & rc_string );

/** Converts a std::String with UTF-8 encoding into a std::wstring.
 */
template < typename StringT >
StringT utf8 ( std::string const & rc_string );

/** Nop specialization for std::string.
 */
template < >
inline std::string utf8 ( std::string const & rc_string )
{
  return rc_string;
}

/** Nop specialization for std::wstring.
 */
template < >
inline std::wstring utf8 ( std::wstring const & rc_string )
{
  return rc_string;
}

template < >
std::string utf8 ( std::wstring const & rc_string )
{
  std::string result;
  if(rc_string.empty())
    return result;

  std::vector<UChar> buffer;

  result.resize(rc_string.size() * 3); // UTF-8 uses max 3 bytes per char
  buffer.resize(rc_string.size() * 2); // UTF-16 uses max 2 bytes per char

  UErrorCode status = U_ZERO_ERROR;
  int32_t len = 0;

  u_strFromWCS(
    &buffer[0],
    buffer.size(),
    &len,
    &rc_string[0],
    rc_string.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strFromWCS failed");
  }
  buffer.resize(len);

  u_strToUTF8(
    &result[0],
    result.size(),
    &len,
    &buffer[0],
    buffer.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strToUTF8 failed");
  }
  result.resize(len);

  return result;
}/* end of utf8 ( ) */


template < >
std::wstring utf8 ( std::string const & rc_string )
{
  std::wstring result;
  if(rc_string.empty())
    return result;

  std::vector<UChar> buffer;

  result.resize(rc_string.size());
  buffer.resize(rc_string.size());

  UErrorCode status = U_ZERO_ERROR;
  int32_t len = 0;

  u_strFromUTF8(
    &buffer[0],
    buffer.size(),
    &len,
    &rc_string[0],
    rc_string.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strFromUTF8 failed");
  }
  buffer.resize(len);

  u_strToWCS(
    &result[0],
    result.size(),
    &len,
    &buffer[0],
    buffer.size(),
    &status
  );
  if(!U_SUCCESS(status))
  {
    throw XXXException("utf8: u_strToWCS failed");
  }
  result.resize(len);

  return result;
}/* end of utf8 ( ) */

使用它就是这么简单：

std::string s = utf8<std::string>(std::wstring(L"some string"));
std::wstring s = utf8<std::wstring>(std::string("some string"));

- lothar

一个错误：UTF-8每个字符最多使用4个字节。一个术语不当的使用：UTF-16每个字符最多使用2个代码单元。 - dalle

2

我总是这样工作：字节流以某种编码方式 -> ICU -> 宽字符流 -> STL & Boost -> 宽字符输出流 -> ICU -> 以某种编码方式的字节流

- puchu

1

格式化日期、时间等可以通过指定特定的区域设置来完成。至于自己编写 -- 无论需要多少或少量从底层库中获取都是可能的。

此外，查看了 c++0x 标准并注意到 utf8、utf16 和 utf32 的文字，那么标准库 (例如字符串、流等) 是否将完全支持这些编码及其之间的转换呢？

是的。但请注意，这些是不同的数据类型，不是您常规的 wchar 序列或 wstring。

如果是这样，有人知道 Visual Studio 何时支持这些特性吗？

据我所知：vc9 (VS2008) 仅对某些 TR1 特性提供部分支持。vc10 (VS2010) 预计会有更好的支持。

- dirkgently

是的，但它不会格式化为特定的编码，当然我可以将其格式化为ASCII字符串，然后进行编码，但如果我想在中文中使用长月份名称，这在ASCII中是不可能的呢？ - Fire Lancer

这就是区域设置的编码部分发挥作用的地方。此外，还要查找facets。 - dirkgently

是的。本地功能往往被低估了。不要强制用户使用某种格式。让系统决定格式，你只需要确保区域设置正确，流就能正常工作。（+1） - Martin York

但请注意，这些是不同的数据类型，而不是您常规的wchar序列或wstring。因此，当我创建一个具有重载>>和<<运算符的类时，现在我将不得不为char、wchar_t和每个Unicode数据类型编写实现（假设我不使用模板，因为我可能不想要它们在头文件中，而是在dll中）。或者会有一种“通用”的流类型吗？ - Fire Lancer

不，使用C++0x，你会使用那些新的类型，而不是wchar_t或wstring。 - dirkgently

-1

我自己写了一个小的封装。如果你想要，我可以分享给你。

- piotr

它是否支持C++流，因为ICU对我来说主要问题是，我想让我的非常大的应用程序与Unicode兼容。 - Fire Lancer

-1

很遗憾。我知道Dinkumware库提供了一些Unicode支持 - 你可以在他们的网站上查看文档。据我所知，这不是免费的。

- Nemanja Trifunovic

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Artyom · Accepted Answer

我真正想要的是类似 ICU 的东西，但包装得更加友好。

不幸的是，并没有这样的东西。它们的 API 并不那么糟糕，因此你可以付出一些努力来适应它。

可以以与语言环境相关的方式格式化时间、日期等（例如在英国使用 dd/mm/yy，在美国使用 mm/dd/yy）。 std::locale 类提供了完整的支持，详细了解如何使用它。你也可以为 std::iostream 指定语言环境，这样它就会正确地格式化数字和日期。

轻松将字符串在各种编码之间转换。 std::locale 提供了用于将 8 位本地编码转换为宽编码和反向转换的 facet。

因此，我可以让它使用 UTF-16。

ICU 在内部使用 utf-16，win32 wchar_t 和 wstring 也使用 utf-16，在其他操作系统下，大多数实现都将 wchar_t 作为 utf-32，并且 wstring 使用 utf-32。

备注：对 std::locale 的支持并非完美，但它已经提供了许多有用的字符操作工具。

请参阅：http://www.cplusplus.com/reference/std/locale/