如何将wchar_t*转换为std::string？

Question

如何将wchar_t*转换为std::string？

40

我改变了我的类，使用了std::string（基于我在这里得到的答案），但我有一个函数返回wchar_t*。如何将其转换为std::string？

我尝试过这样做：

std::string test = args.OptionArg();

但是它显示错误C2440: 'initializing'：无法将'wchar_t *'转换为'std::basic_string<_Elem，_Traits，_Ax>'

- codefrog

7个回答

10

您可以使用以下函数将宽字符字符串转换为ASCII字符串：

#include <locale>
#include <sstream>
#include <string>

std::string ToNarrow( const wchar_t *s, char dfault = '?', 
                      const std::locale& loc = std::locale() )
{
  std::ostringstream stm;

  while( *s != L'\0' ) {
    stm << std::use_facet< std::ctype<wchar_t> >( loc ).narrow( *s++, dfault );
  }
  return stm.str();
}

请注意，这将使用 dfault 参数仅替换任何不存在等效ASCII字符的宽字符; 它不会从UTF-16转换为UTF-8。如果您想要转换为UTF-8，请使用诸如 ICU 的库。

- Praetorian

8

令人失望的是，对于将宽字符串转换为UTF-8字符串的问题，这个老问题的任何答案都没有解决，这在非英语环境中非常重要。

以下是一个工作正常的示例代码，可以用作构建自定义转换器的提示。它基于cppreference.com中的示例代码。

#include <iostream>
#include <clocale>
#include <string>
#include <cstdlib>
#include <array>

std::string convert(const std::wstring& wstr)
{
    const int BUFF_SIZE = 7;
    if (MB_CUR_MAX >= BUFF_SIZE) throw std::invalid_argument("BUFF_SIZE too small");
    std::string result;
    bool shifts = std::wctomb(nullptr, 0);  // reset the conversion state
    for (const wchar_t wc : wstr)
    {
        std::array<char, BUFF_SIZE> buffer;
        const int ret = std::wctomb(buffer.data(), wc);
        if (ret < 0) throw std::invalid_argument("inconvertible wide characters in the current locale");
        buffer[ret] = '\0';  // make 'buffer' contain a C-style string
        result = result + std::string(buffer.data());
    }
    return result;
}

int main()
{
    auto loc = std::setlocale(LC_ALL, "en_US.utf8");  // UTF-8
    if (loc == nullptr) throw std::logic_error("failed to set locale");
    std::wstring wstr = L"aąß水-扫描-€\u00df\u6c34\U0001d10b";
    std::cout << convert(wstr) << "\n";
}

这将按预期打印：

说明

7似乎是缓冲区大小BUFF_SIZE的最小安全值。它包括4作为单个字符最大UTF-8字节编码的可能性；2个用于可能的"shift sequence"，1个用于尾部的'\0'。
MB_CUR_MAX是一个运行时变量，因此在这里不能使用static_assert
每个宽字符都使用std::wctomb转换为其char表示形式
只有当当前语言环境允许字符的多字节表示时，才有意义进行此转换
为了使其生效，应用程序需要设置适当的语言环境。en_US.utf8似乎足够通用（在大多数计算机上可用）。在Linux中，可以通过控制台上的locale -a命令查询可用的语言环境。

对最受欢迎答案的评价

最受欢迎的答案是：

std::wstring ws( args.OptionArg() );
std::string test( ws.begin(), ws.end() );

只有当广泛使用的字符表示ASCII字符时才能很好地工作-但这不是广泛使用的字符设计的目的。在这个解决方案中，转换后的字符串每个源宽字符包含一个字符，ws.size() == test.size()。因此，它丢失了来自原始wstring的信息，并生成无法解释为正确的UTF-8序列的字符串。例如，在我的机器上，从“ĄŚĆII”的简单转换产生的字符串打印为“ZII”，即使其大小为5（应该是8）。

- zkoza

8

这是一个老问题，但如果你不是真正正在寻求转换而是使用来自微软的TCHAR工具来构建ASCII和Unicode，你可以回想一下std::string实际上是

typedef std::basic_string<char> string

因此，我们可以定义自己的typedef，例如：

#include <string>
namespace magic {
typedef std::basic_string<TCHAR> string;
}

然后您可以使用magic::string与TCHAR，LPCTSTR等一起使用

- paulluap

4

你可以使用 wstring 并将所有内容保留在 Unicode 中。

- Steve Townsend

2

如果我使用.c_str()，我仍然会得到一个const char吗？我有其他期望const char的函数。 - codefrog

1

我猜测您正在使用Unicode构建项目，但实际上并不需要。如果我的猜测正确，您可以更改项目属性以不构建Unicode，然后就可以使用“string”了。在项目属性、配置属性、常规、字符集中检查此设置。您需要将其设置为“使用多字节字符集”，以消除Unicode的影响。 - Steve Townsend

3

因为你正在Windows上进行编程，所以最好使用Unicode。Windows API和NTFS原生支持UTF-16，因此构建ASCII应用程序会产生额外的开销，因为每个函数都要为你进行字符串转换。 - Praetorian

1

许多应用程序在内部使用utf-8。Windows是一个大问题，因为wchar_t不够大，并且它并没有完全支持utf-8。当你有一个使用utf-8的大型代码库应用程序时，这会使生活变得困难。大多数情况下，这很好用，但与一些操作系统级别的函数交互时会变得很烦人。 - Stephen

41

如果答案甚至不能回答问题，那它怎么能成为被接受的答案呢？ - riv

显示剩余5条评论

3

以下代码更加简洁：

wchar_t wstr[500];
char string[500];
sprintf(string,"%ls",wstr);

- Pamela Hauff

3

仅供娱乐 :-):

const wchar_t* val = L"hello mfc";
std::string test((LPCTSTR)CString(val));

- Danil

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ulterior · Accepted Answer

54

std::wstring ws( args.OptionArg() );
std::string test( ws.begin(), ws.end() );

- Ulterior

8

提供实际答案以回答问题！ - Ian

2

我喜欢这个解决方案的简单性。然而，稍微解释一下也无妨。它留下了一个问题，即字符是如何实际转换的。是否存在信息丢失，还是将宽字符转换为Unicode？ - Julian

25

我不知道为什么这个答案会得到那么多赞，它所做的事情相当于对每个字符执行char c = static_cast<char>( wideChar )，因此如果宽字符串字符不在ASCII范围内，它显然会丢失信息。 - zett42

我的英雄！感谢您直接为我们99.9％的人提供答案。 - daparic

@zett42，任何将wchar_t转换为std::string的方法都会是有损转换，因为这是其定义所决定的。 - j b

4

取决于 std::string 的编码方式。例如，使用 UTF-8 编码时不会丢失信息。 - zett42