在C++11中迭代遍历UTF-8字符串

Question

在C++11中迭代遍历UTF-8字符串

11

我正在尝试遍历一个UTF-8字符串。据我所知，问题在于UTF-8字符具有可变长度，因此我不能只按照字符逐个遍历，而必须使用某种转换方法。我相信现代C++中有一个可以解决这个问题的函数，但我不知道它是什么。

#include <iostream>
#include <string>

int main()
{
  std::string text = u8"řabcdě";
  std::cout << text << std::endl; // Prints fine
  std::cout << "First letter is: " << text.at(0) << text.at(1) << std::endl; // Again fine. So 'ř' is a 2 byte letter?

  for(auto it = text.begin(); it < text.end(); it++)
  {
    // Obviously wrong. Outputs only ascii part of the text (a, b, c, d) correctly
    std::cout << "Iterating: " << *it << std::endl; 
  }
}

使用clang++ -std=c++11 -stdlib=libc++ test.cpp编译。

根据我的阅读，不应使用wchar_t和wstring。

- Jan Šimek

"UTF-8字符"这种说法是不存在的。如果你对这个主题不熟悉，那么直接开始编写代码会让你感到沮丧和没有收获。 - Kerrek SB

你是在Unixoid系统还是Windows系统上？你需要代码单元，代码点还是字形？（字符的上下文依赖性非常荒谬（即使上下文可能不足以决定），在Windows系统上还会有额外的问题） - Deduplicator

1

你可能想看一下这里：http://en.cppreference.com/w/cpp/locale/wstring_convert/from_bytes。请注意，它在gcc中不起作用，因为他们尚未实现标准的这部分内容，但在clang/libc++中可以使用，并且应该可以在VS2013中使用（如果我没记错的话）。 - n. m.

可能是[跨平台迭代Unicode字符串（使用ICU计算Graphemes）]的重复问题（https://dev59.com/om455IYBdhLWcg3wCPfh）。 - Deduplicator

@n.m. 谢谢，这很有效，也正是我一直在寻找的（尽管很遗憾gcc还不支持它）。你可以将其提交为答案。 - Jan Šimek

显示剩余2条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Jan Šimek · Accepted Answer

正如n.m.建议的那样，我使用了std::wstring_convert：

#include <codecvt>
#include <locale>
#include <iostream>
#include <string>

int main()
{
  std::u32string input = U"řabcdě";

  std::wstring_convert<std::codecvt_utf8<char32_t>, char32_t> converter;

  for(char32_t c : input)
  {
    std::cout << converter.to_bytes(c) << std::endl;
  }
}

也许我在问题中应该更清楚地指出，我想知道是否可以在C++11中实现此功能，而不使用任何第三方库，如ICU或UTF8-CPP。