中文字符的UTF-8宽度显示问题

Question

中文字符的UTF-8宽度显示问题

5

当我使用Perl或C来printf一些数据时，我尝试使用它们的格式来控制每列的宽度，比如：

printf("%-30s", str);

但是当字符串包含中文字符时，该列的对齐方式就不如预期那样了。请参见附件图片。

我的Ubuntu字符集编码是zh_CN.utf8，据我所知，utf-8编码有1到4个字节的长度。中文字符有3个字节。在我的测试中，我发现printf的格式控制将一个中文字符计为3个，但它实际上显示为2个ASCII宽度。

因此，实际的显示宽度并不像预期的那样是一个常量，而是与中文字符的数量相关的变量，即：

Sw(x) = 1 * (w - 3x) + 2 * x = w - x

w是预期的宽度限制，x是中文字符的数量，Sw(x)是实际显示宽度。

因此，包含更多中文字符的字符串将显示得更短。

我应该如何得到想要的结果？在printf之前计算中文字符数吗？

据我所知，所有中文甚至所有宽字符都显示为2个宽度，那么为什么printf会将其计为3个呢？UTF-8的编码与显示长度无关。

- gpanda

换句话说，您正在寻找适用于Perl和/或C的多字节感知版本的printf函数？ - deceze

1

@dystroy 这不仅仅是计算代码点（即符文）的问题。相反，它考虑到不同的代码点根据UAX#11表示0、1或2个打印列，这相当微妙，特别是对于East_Asian_Width=Ambiguous字符。我不知道是否有任何Go库可以像我回答中描述的Perl库那样处理这个问题，但如果有这样的东西供Go使用，我很想了解！谢谢。 - tchrist

显示宽度（屏幕位置数）、字符数和字节数是三个不同的概念。printf 只关心字节数。如果您想考虑字符数，请使用 wprintf（记住，它需要一个 wchar_t* 格式）。在 C 中没有格式化函数考虑显示宽度。 - n. m.

@tchrist：printf由C标准定义，任何偏离它的行为都是非标准的。 - n. m.

@tchrist：抱歉，我应该明确指出我只谈论C，而不是Perl或其他任何语言。 - n. m.

显示剩余6条评论

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- tchrist · Accepted Answer

是的，我知道所有版本的 printf 都存在这个问题。我在这个答案和这个答案中简要讨论了此事。

对于 C 语言，我不知道有没有库可以为您解决这个问题，但如果有的话，那就应该是 ICU。

对于 Perl，您需要使用 CPAN 中的 Unicode::GCString 模块来计算 Unicode 字符串占用的打印列数。它考虑了Unicode 标准附录#11：东亚宽度。

例如，某些代码点占用1列，而其他代码点占用2列。甚至还有一些根本不占用任何列，如组合字符和不可见的控制字符。该类具有一个 columns 方法，返回字符串占用的列数。

我有一个使用该类进行垂直对齐 Unicode 文本的示例，在这里。它会对一堆 Unicode 字符串进行排序，包括一些带有组合字符和“宽”亚洲表意文字（CJK 字符），并允许您垂直对齐这些字符串。

sample terminal output

下面是打印漂亮对齐输出的小 umenu 演示程序代码：

您可能还会对更为雄心勃勃的 Unicode::LineBreak 模块感兴趣，其前面提到的 Unicode::GCString 类只是其较小的组成部分。该模块更加强大，考虑了Unicode 标准附录#14：Unicode 文本断行算法。

下面是经过 Perl v5.14 测试的小 umenu 演示程序的代码：

 #!/usr/bin/env perl
 # umenu - demo sorting and printing of Unicode food
 #
 # (obligatory and increasingly long preamble)
 #
 use utf8;
 use v5.14;                       # for locale sorting
 use strict;
 use warnings;
 use warnings  qw(FATAL utf8);    # fatalize encoding faults
 use open      qw(:std :utf8);    # undeclared streams in UTF-8
 use charnames qw(:full :short);  # unneeded in v5.16

 # std modules
 use Unicode::Normalize;          # std perl distro as of v5.8
 use List::Util qw(max);          # std perl distro as of v5.10
 use Unicode::Collate::Locale;    # std perl distro as of v5.14

 # cpan modules
 use Unicode::GCString;           # from CPAN

 # forward defs
 sub pad($$$);
 sub colwidth(_);
 sub entitle(_);

 my %price = (
     "γύρος"             => 6.50, # gyros, Greek
     "pears"             => 2.00, # like um, pears
     "linguiça"          => 7.00, # spicy sausage, Portuguese
     "xoriço"            => 3.00, # chorizo sausage, Catalan
     "hamburger"         => 6.00, # burgermeister meisterburger
     "éclair"            => 1.60, # dessert, French
     "smørbrød"          => 5.75, # sandwiches, Norwegian
     "spätzle"           => 5.50, # Bayerisch noodles, little sparrows
     "包子"              => 7.50, # bao1 zi5, steamed pork buns, Mandarin
     "jamón serrano"     => 4.45, # country ham, Spanish
     "pêches"            => 2.25, # peaches, French
     "シュークリーム"    => 1.85, # cream-filled pastry like éclair, Japanese
     "막걸리"            => 4.00, # makgeolli, Korean rice wine
     "寿司"              => 9.99, # sushi, Japanese
     "おもち"            => 2.65, # omochi, rice cakes, Japanese
     "crème brûlée"      => 2.00, # tasty broiled cream, French
     "fideuà"            => 4.20, # more noodles, Valencian (Catalan=fideuada)
     "pâté"              => 4.15, # gooseliver paste, French
     "お好み焼き"        => 8.00, # okonomiyaki, Japanese
 );

 my $width = 5 + max map { colwidth } keys %price;

 # So the Asian stuff comes out in an order that someone
 # who reads those scripts won't freak out over; the
 # CJK stuff will be in JIS X 0208 order that way.
 my $coll  = new Unicode::Collate::Locale locale => "ja";

 for my $item ($coll->sort(keys %price)) {
     print pad(entitle($item), $width, ".");
     printf " €%.2f\n", $price{$item};
 }

 sub pad($$$) {
     my($str, $width, $padchar) = @_;
     return $str . ($padchar x ($width - colwidth($str)));
 }

 sub colwidth(_) {
     my($str) = @_;
     return Unicode::GCString->new($str)->columns;
 }

 sub entitle(_) {
     my($str) = @_;
     $str =~ s{ (?=\pL)(\S)     (\S*) }
              { ucfirst($1) . lc($2)  }xge;
     return $str;
 }

正如您所见，使它在该特定程序中运行的关键是这行代码，它只是调用上面定义的其他函数，并使用我讨论的模块：

print pad(entitle($item), $width, ".");

这将使用点作为填充字符，将该项填充到给定的宽度。

是的，这比 printf 不方便得多，但至少是可行的。