这个奇怪的问题在Windows上解析UTF-8命令行参数的原因是什么？

Question

这个奇怪的问题在Windows上解析UTF-8命令行参数的原因是什么？

5

我正在尝试传递一个使用UNICODE字符的字符串："right single quotation mark" Decimal: 8217 Hex: \x{2019}

Perl没有正确接收该字符。让我向您展示细节：

以下是Perl脚本（我们将其称为test.pl）：

use warnings;
use strict;
use v5.32;
use utf8; # Some UTF-8 chars are present in the code's comments

# Get the first argument
my $arg=shift @ARGV or die 'This script requires one argument';

# Get some env vars with sensible defaults if absent
my $lc_all=$ENV{LC_ALL} // '{unset}';
my $lc_ctype=$ENV{LC_CTYPE} // '{unset}';
my $lang=$ENV{LANG} // '{unset}';

# Determine the current Windows code page
my ($active_codepage)=`chcp 2>NUL`=~/: (\d+)/;

# Our environment
say "ENV: LC_ALL=$lc_all LC_CTYPE=$lc_ctype LANG=$lang";
say "Active code page: $active_codepage"; # Note: 65001 is UTF-8

# Saying the wrong thing, expected: 0’s    #### Note: Between the '0' and the 's'
#   is a "right single quotation mark" and should be in utf-8 => 
#   Decimal: 8217 Hex: \x{2019}
# For some strange reason the bytes "\x{2019}" are coming in as "\x{92}" 
#   which is the single-byte CP1252 representation of the character "right 
#   single quotation mark"
# The whole workflow is UTF-8, so I don't know where there is a CP1252 
#   translation of the input argument (outside of Perl that is)

# Display the value of the argument and its length
say "Argument: $arg length: ",length($arg);

# Display the bytes that make up the argument's string
print("Argument hex bytes:");
for my $chr_idx (0 .. length($arg)-1)
{
  print sprintf(' %02x',ord(substr($arg,$chr_idx,1)));
}
say ''; # Newline

我按照以下方式运行Perl脚本：

V:\videos>c:\perl\5.32.0\bin\perl test.pl 0’s

输出：

ENV: LC_ALL=en-US.UTF-8 LC_CTYPE={unset} LANG=en_US.UTF-8
Argument: 0s length: 3
Argument hex bytes: 30 92 73

好的，也许我们需要指定UTF-8 所有（标准输入/输出/错误和命令行参数）？

V:\videos>c:\perl\5.32.0\bin\perl -CSDA test.pl 0’s

输出：

ENV: LC_ALL=en-US.UTF-8 LC_CTYPE={unset} LANG=en_US.UTF-8
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 30 92 73

好的，让我们尝试完全删除所有LC*/LANG环境变量，结果如下：

@SET LC_ALL=
@SET LANG=

@REM Proof that everything has been cleared
@REM Note: The caret before the vertical bar escapes it,
@REM       because I have grep set up to run through a
@REM       batch file and need to forward args
@set | grep -iP "LC^|LANG" || echo %errorlevel%

输出：

让我们尝试使用UTF-8再次执行脚本：

V:\videos>c:\perl\5.32.0\bin\perl -CSDA 0’s

输出结果（除了清除了LC*/LANG环境变量之外没有任何改变）：

ENV: LC_ALL={unset} LC_CTYPE={unset} LANG={unset}
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 30 92 73

在这一点上，我决定走出 Perl 的范畴，看看 Windows 10 本身是如何处理我的命令行参数的。我有一个小实用程序，是我以前用 C# 写的，可以帮助解决命令行参数问题，并用它进行了测试。输出应该是不言自明的：

V:\videos>ShowArgs 0’s

Filename: |ShowArgs.exe|
Pathname: |c:\bin\ShowArgs.exe|
Work dir:  |V:\videos|

Command line: ShowArgs  0’s

Raw command line characters:

000: |ShowArgs  |: S (083:53) h (104:68) o (111:6F) w (119:77) A (065:41) r (114:72) g (103:67) s (115:73)   (032:20)   (032:20)
010: |0’s       |: 0 (048:30) ’ (8217:2019) s (115:73)

Command line args:

00: |0’s|

这里展示了几个问题：

传入的参数不需要加引号（我本来就没想过要加）
Windows 正确地将 UTF-8 编码的参数传递给应用程序

但是我无论如何都想不明白为什么 Perl 在这一点上没有接收到 UTF-8 的参数。

当然，作为一个绝对的“hack”，如果我在 Perl 脚本的底部添加以下内容，问题就会得到解决。但我想知道为什么 Perl 没有接收到 UTF-8 的参数：

# ... Appended to original script shown at top ...
use Encode qw(encode decode);

sub recode 
{ 
  return encode('UTF-8', decode( 'cp1252', $_[0] ));
}

say "\n@{['='x60]}\n"; # Output separator
say "Original arg: $arg";
say "After recoding CP1252 -> UTF-8: ${\recode($arg)}";

脚本执行：

V:\videos>c:\perl\5.32.0\bin\perl test.pl 0’s

新输出：

ENV: LC_ALL=en_US.UTF-8 LC_CTYPE={unset} LANG={unset}
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 0030 0092 0073

============================================================

Original arg: 0s
After recoding CP1252 -> UTF-8: 0’s

更新

我建立了一个简单的C++测试应用程序，以更好地掌握发生的情况。

以下是源代码：

#include <cstdint>
#include <cstring>
#include <iostream>
#include <iomanip>

int main(int argc, const char *argv[])
{
  if (argc!=2)
  {
    std::cerr << "A single command line argument is required\n";
    return 1;
  }

  const char *arg=argv[1];
  std::size_t arg_len=strlen(arg);

  // Display argument as a string
  std::cout << "Argument: " << arg << " length: " << arg_len << '\n';

  // Display argument bytes
  // Fill with leading zeroes
  auto orig_fill_char=std::cout.fill('0');

  std::cout << "Bytes of argument, in hex:";
  std::cout << std::hex;
  for (std::size_t arg_idx=0; arg_idx<arg_len; ++arg_idx)
  {
    // Note: The cast to uint16_t is necessary because uint8_t is formatted 
    //       "specially" (i.e., still as a char and not as an int)
    //       The cast through uint8_t is necessary due to sign extension of
    //       the original char if going directly to uint16_t and the (signed) char
    //       value is negative.
    //       I could have also masked off the high byte after the cast, with
    //       insertion code like (Note: Parens required due to precedence):
    //         << (static_cast<uint16_t>(arg[arg_idx]) & 0x00ff)
    //       As they say back in Perl-land, "TMTOWTDI!", and in this case it
    //       amounts to the C++ version of Perl "line noise" no matter which
    //       way you slice it. :)
    std::cout << ' ' 
              << std::setw(2) 
              << static_cast<uint16_t>(static_cast<uint8_t>(arg[arg_idx])); 
  }
  std::cout << '\n';

  // Restore the original fill char and go back to decimal mode
  std::cout << std::setfill(orig_fill_char) << std::dec;
}

上述代码是以MBCS字符集设置为基础的64位控制台应用程序运行的：

testapp.exe 0’s

...，并生成了以下输出：

Argument: 0s length: 3
Argument bytes: 30 92 73

所以，它最终还是Windows系统，至少在某种程度上。我需要构建一个UNICODE字符集版本的应用程序并查看结果。

最终解决方案

感谢Eryk Sun对ikegami的回答和其中的链接的评论，我已经找到了最佳解决方案，至少在Windows 10上是如此。现在我将概述具体步骤，以强制Windows将命令行参数作为UTF-8发送到Perl：

需要向perl.exe和wperl.exe（如果您使用它）添加清单，告诉Windows在执行perl.exe应用程序时使用UTF-8作为活动代码页（ACP）。这将告诉Windows将命令行参数传递给perl作为UTF-8而不是CP1252。

需要进行的更改

创建清单文件

前往你的 perl.exe (和 wperl.exe) 所在的位置，在该 (...\bin) 目录中创建一个文件，内容如下，并将其命名为 perl.exe.manifest:

<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
  <assemblyIdentity type="win32" name="perl.exe" version="6.0.0.0"/>
  <application>
    <windowsSettings>
      <activeCodePage
        xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
      >UTF-8</activeCodePage>
    </windowsSettings>
  </application>
</assembly>

如果您还想修改wperl.exe，请将上述文件perl.exe.manifest复制到wperl.exe.manifest并修改该文件，替换assemblyIdentity行：

  <assemblyIdentity type="win32" name="perl.exe" version="6.0.0.0"/>

使用 wperl.exe 替换 perl.exe 分配给 name 属性的值，代码如下：

  <assemblyIdentity type="win32" name="wperl.exe" version="6.0.0.0"/>

将清单嵌入可执行文件中

下一步是将刚刚创建的清单文件嵌入到它们各自的可执行文件中。在执行此操作之前，请务必备份原始可执行文件，以防万一！

可以按照以下方式将清单嵌入可执行文件中：

对于 perl.exe：

mt.exe -manifest perl.exe.manifest -outputresource:perl.exe;#1

对于 wperl.exe（可选项，仅在使用 wperl.exe 时需要）：

mt.exe -manifest wperl.exe.manifest -outputresource:wperl.exe;#1

如果您还没有mt.exe可执行文件，它可以作为Windows 10 SDK的一部分找到，目前位于：在developer.microsoft.com下载Windows 10 SDK

基本测试和用法

在进行以上更改后，UTF-8命令行参数变得非常容易！

使用以下脚本simple-test.pl：

use strict;
use warnings;
use v5.32; # Or whatever recent version of Perl you have

# Helper subroutine to provide simple hex table output formatting
sub hexdump
{
  my ($arg)=@_;
  sub BYTES_PER_LINE {16}; # Output 16 hex pairs per line

  for my $chr_idx (0 .. length($arg)-1)
  {
    # Break into groups of 16 hex digit pairs per line
    print sprintf('\n  %02x: ', $chr_idx+1/BYTES_PER_LINE)
      if $chr_idx%BYTES_PER_LINE==0;
    print sprintf('%02x ',ord(substr($arg,$chr_idx,1)));
  }
  say '';
}

# Test app code that makes no mention of Windows, ACPs, or UTF-8 outside
# of stuff that is printed. Other than the call out to chcp to get the
# active code page for informational purposes, it is not particularly tied
# to Windows, either, as long as whatever environment it is run on
# passes the script its arg as UTF-8, of course.
my $arg=shift @ARGV or die 'No argument present';

say "Argument: $arg";
say "Argument byte length: ${\length($arg)} bytes";
print 'Argument UTF-8 data bytes in hex:';
hexdump($arg);

让我们测试一下脚本，确保我们处于UTF-8代码页（65001）：

v:\videos>chcp 65001 && perl.exe simple-test.pl "Работа с ’ vis-à-vis 0's using UTF-8"

输出（假设您的控制台字体可以处理特殊字符）：

Active code page: 65001
Argument: Работа с ’ vis-à-vis 0's using UTF-8
Argument byte length: 54 bytes
Argument UTF-8 data bytes in hex:
  00: d0 a0 d0 b0 d0 b1 d0 be d1 82 d0 b0 20 d1 81 20
  10: f0 9d 9f 98 e2 80 99 f0 9d 99 a8 20 76 69 73 2d
  20: c3 a0 2d 76 69 73 20 30 27 73 20 75 73 69 6e 67
  30: 20 55 54 46 2d 38

我希望我的解决方案能够帮助其他遇到这个问题的人。

- Michael Goldshteyn

“92”在我找到的任何字符集中都不对应引号。但你在问题中肯定放了U+2019。奇怪。 - Schwern

Windows本地支持UTF-16编码。如果Perl支持UTF-8编码的命令行参数、环境变量和控制台I/O，那么它是通过UTF-16和UTF-8之间的转码来实现的。唯一的例外是，在Windows 8+中，控制台输出代码页可以使用UTF-8（65001），但将输入代码页设置为UTF-8仅限于7位ASCII；非ASCII字符会被读取为null字节。支持Windows控制台输入和输出的UTF-8的唯一可靠方法是使用宽字符API（例如ReadConsoleW，WriteConsoleW）并在UTF-16和UTF-8之间进行转码。Python实现了这一点。Perl是否也实现了呢？ - Eryk Sun

@Schwern 维基百科上的Windows CP1252表中看一下\x{92} - 它在表格中是第9_行，第_2列。 - Michael Goldshteyn

编写了一个C++测试应用程序来进一步测试发生了什么。至少在使用MBCS字符集控制台应用程序时，我看到了相同的错误行为（与Perl无关）。接下来将尝试创建一个UNICODE字符集版本并查看结果。 - Michael Goldshteyn

在ikegami所提供的答案中，我添加了微软链接所呈现的想法摘要。 - Michael Goldshteyn

2个回答

0

use utf8 只是让 Perl 接受像变量名和函数中的 UTF-8 语法。其他所有内容都不变，包括 @ARGV。因此，my $arg=shift @ARGV 是读取原始字节。

Perl 中的 Unicode 非常复杂。最简单的方法是改用 use utf8::all，它会打开语法、所有文件句柄、@ARGV 和其他所有东西的 UTF-8。

- Schwern

@MichaelGoldshteyn 这说明你的输入是CP1252格式的 \x{92}，而不是UTF-8格式的 \x{2019}。 - Schwern

根据我问题底部的“hackish”代码，我已经弄清楚了这一点，它可以将CP1252转换为UTF-8，并产生正确的输出。但是，最大的问题是：为什么？特别是考虑到chcp正确地报告65001，而我的ShowArgs工具正确显示命令行参数的UTF-8数据？ - Michael Goldshteyn

谢谢您提到 utf::all！当然也感谢您对它的贡献 :) 我之前不知道这个，看起来非常方便和有用！ - zdim

chcp 命令显示 65001（就像从脚本中运行时一样）。 - Michael Goldshteyn

这里 chcp 是无法帮助的，而在 Windows 系统上使用 utf8::all 也不是很有用。详情请参见我的回答。 - ikegami

显示剩余4条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ikegami · Accepted Answer

每个处理字符串的Windows系统调用都有两种变体：使用Active Code Page（又称ANSI Code Page）的“A”NSI版本和使用UTF-16le的“W”ide版本。^[1] Perl使用所有系统调用的A版本。这包括调用获取命令行。

ACP是硬编码的。（或者也许在安装过程中Windows会要求选择系统语言并以此为基础？我记不清了。）例如，在我的系统上它是1252，我无法改变它。值得注意的是，chcp对ACP没有影响。

至少直到最近为止是这样的。2019年5月的Windows更新添加了通过其清单在应用程序级别上更改ACP的功能。（该页面指出可以更改现有应用程序的清单。）

chcp 可以更改控制台的 CP，但不能更改 A 系统调用使用的编码。将其设置为包含 ’ 的代码页可确保您可以键入 ’，而 Perl 可以输出一个正确编码的 ’。^[2] 由于 65001 包含 ’，因此您在执行这两个操作时不会遇到问题。

控制台 CP（由 chcp 设置）的选择对 Perl 接收命令行的方式没有影响。因为 Perl 使用系统调用的 A 版本，所以命令行将使用 ACP 进行编码，无论控制台的 CP 和 OEM CP 如何。

根据事实，即'被编码为92，您的系统似乎也使用1252作为其活动代码页。因此，您可以按照以下方式解决问题：

use Encode qw( decode );

my @ARGV = map { decode("cp1252", $_) } @ARGV;

请参考这篇文章，其中提供了更通用和可移植的解决方案，并且还为STDIN、STDOUT和STDERR添加了适当的编码/解码层。

但是，如果您想支持任意Unicode字符而不是仅限于系统ACP中找到的字符，该怎么办？如上所述，您可以更改perl的ACP。将其更改为650001（UTF-8）将使您访问整个Unicode字符集。

除此之外，您需要使用系统调用的W版本从操作系统获取命令行并解析它。

虽然Perl使用系统调用的A版本，但这并不限制模块执行相同的操作。它们可以使用W系统调用。^[3]因此，也许有一个模块可以满足您的需求。如果没有，我之前写过代码，可以做到这一点。

非常感谢 @Eryk Sun 在评论中提供的帮助。

可以使用Win32::GetACP()获取ACP。
可以使用Win32::GetOEMCP()获取OEM CP。
可以使用Win32::GetConsoleCP() / Win32::GetConsoleOutputCP()获取控制台的CP。

SetFileApisToOEM 可以用来改变一些 A 系统调用使用的编码为 OEM CP。^[2]
控制台的 CP 默认为系统的 OEM CP。可以通过更改 HKCU\Console\<window title> 注册表键的 CodePage 值来覆盖此设置，其中 <window title> 是控制台的初始窗口标题。当然，也可以使用 chcp 和其底层系统调用来覆盖此设置。
值得注意的是，请参见 Win32::LongPath。