以下是我的C程序。在bash中,程序会打印“char is”,但不会打印“Ω”。我的语言环境都是en_US.utf8。
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
int main() {
int r;
wchar_t myChar1 = L'Ω';
r = wprintf(L"char is %c\n", myChar1);
}
这很有趣。显然编译器将 omega 从 UTF-8 转换为 UNICODE,但是 libc 不知何故将其搞乱了。
首先:格式说明符 %c
需要一个 char
(即使在 wprintf 版本中也是如此),因此您必须指定 %lc
(对于字符串,还需要使用 %ls
)。
其次,如果您以这种方式运行代码,则区域设置将设置为 C
(不会自动从环境中获取)。您需要使用空字符串调用 setlocale
来从环境中获取区域设置,这样 libc 就会再次正常工作。
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
#include <locale.h>
int main() {
int r;
wchar_t myChar1 = L'Ω';
setlocale(LC_CTYPE, "");
r = wprintf(L"char is %lc (%x)\n", myChar1, myChar1);
}
除了那个建议修复LIBC的答案,你也可以这样做:
#include <stdio.h>
#include <wchar.h>
#include <stdlib.h>
// NOTE: *NOT* thread safe, not re-entrant
const char* unicode_to_utf8(wchar_t c)
{
static unsigned char b_static[5];
unsigned char* b = b_static;
if (c<(1<<7))// 7 bit Unicode encoded as plain ascii
{
*b++ = (unsigned char)(c);
}
else if (c<(1<<11))// 11 bit Unicode encoded in 2 UTF-8 bytes
{
*b++ = (unsigned char)((c>>6)|0xC0);
*b++ = (unsigned char)((c&0x3F)|0x80);
}
else if (c<(1<<16))// 16 bit Unicode encoded in 3 UTF-8 bytes
{
*b++ = (unsigned char)(((c>>12))|0xE0);
*b++ = (unsigned char)(((c>>6)&0x3F)|0x80);
*b++ = (unsigned char)((c&0x3F)|0x80);
}
else if (c<(1<<21))// 21 bit Unicode encoded in 4 UTF-8 bytes
{
*b++ = (unsigned char)(((c>>18))|0xF0);
*b++ = (unsigned char)(((c>>12)&0x3F)|0x80);
*b++ = (unsigned char)(((c>>6)&0x3F)|0x80);
*b++ = (unsigned char)((c&0x3F)|0x80);
}
*b = '\0';
return b_static;
}
int main() {
int r;
wchar_t myChar1 = L'Ω';
r = printf("char is %s\n", unicode_to_utf8(myChar1));
return 0;
}
wchar_t
的唯一目的是在理论上支持不同区域设置中的不同输出编码。如果您想硬编码UTF-8,只需使用char *myChar1 = "Ω";
,然后使用printf
和%s
即可... - R.. GitHub STOP HELPING ICE在输出之前使用{glib,libiconv,ICU}将其转换为UTF-8。