我正在尝试传递一个使用UNICODE字符的字符串:"right single quotation mark" Decimal: 8217 Hex: \x{2019}
Perl没有正确接收该字符。让我向您展示细节:
以下是Perl脚本(我们将其称为test.pl
):
use warnings;
use strict;
use v5.32;
use utf8; # Some UTF-8 chars are present in the code's comments
# Get the first argument
my $arg=shift @ARGV or die 'This script requires one argument';
# Get some env vars with sensible defaults if absent
my $lc_all=$ENV{LC_ALL} // '{unset}';
my $lc_ctype=$ENV{LC_CTYPE} // '{unset}';
my $lang=$ENV{LANG} // '{unset}';
# Determine the current Windows code page
my ($active_codepage)=`chcp 2>NUL`=~/: (\d+)/;
# Our environment
say "ENV: LC_ALL=$lc_all LC_CTYPE=$lc_ctype LANG=$lang";
say "Active code page: $active_codepage"; # Note: 65001 is UTF-8
# Saying the wrong thing, expected: 0’s #### Note: Between the '0' and the 's'
# is a "right single quotation mark" and should be in utf-8 =>
# Decimal: 8217 Hex: \x{2019}
# For some strange reason the bytes "\x{2019}" are coming in as "\x{92}"
# which is the single-byte CP1252 representation of the character "right
# single quotation mark"
# The whole workflow is UTF-8, so I don't know where there is a CP1252
# translation of the input argument (outside of Perl that is)
# Display the value of the argument and its length
say "Argument: $arg length: ",length($arg);
# Display the bytes that make up the argument's string
print("Argument hex bytes:");
for my $chr_idx (0 .. length($arg)-1)
{
print sprintf(' %02x',ord(substr($arg,$chr_idx,1)));
}
say ''; # Newline
我按照以下方式运行Perl脚本:
V:\videos>c:\perl\5.32.0\bin\perl test.pl 0’s
输出:
ENV: LC_ALL=en-US.UTF-8 LC_CTYPE={unset} LANG=en_US.UTF-8
Argument: 0s length: 3
Argument hex bytes: 30 92 73
好的,也许我们需要指定UTF-8 所有(标准输入/输出/错误和命令行参数)?
V:\videos>c:\perl\5.32.0\bin\perl -CSDA test.pl 0’s
输出:
ENV: LC_ALL=en-US.UTF-8 LC_CTYPE={unset} LANG=en_US.UTF-8
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 30 92 73
好的,让我们尝试完全删除所有LC*
/LANG
环境变量,结果如下:
@SET LC_ALL=
@SET LANG=
@REM Proof that everything has been cleared
@REM Note: The caret before the vertical bar escapes it,
@REM because I have grep set up to run through a
@REM batch file and need to forward args
@set | grep -iP "LC^|LANG" || echo %errorlevel%
输出:
1
让我们尝试使用UTF-8再次执行脚本:
V:\videos>c:\perl\5.32.0\bin\perl -CSDA 0’s
输出结果(除了清除了LC*
/LANG
环境变量之外没有任何改变):
ENV: LC_ALL={unset} LC_CTYPE={unset} LANG={unset}
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 30 92 73
在这一点上,我决定走出 Perl 的范畴,看看 Windows 10 本身是如何处理我的命令行参数的。我有一个小实用程序,是我以前用 C# 写的,可以帮助解决命令行参数问题,并用它进行了测试。输出应该是不言自明的:
V:\videos>ShowArgs 0’s
Filename: |ShowArgs.exe|
Pathname: |c:\bin\ShowArgs.exe|
Work dir: |V:\videos|
Command line: ShowArgs 0’s
Raw command line characters:
000: |ShowArgs |: S (083:53) h (104:68) o (111:6F) w (119:77) A (065:41) r (114:72) g (103:67) s (115:73) (032:20) (032:20)
010: |0’s |: 0 (048:30) ’ (8217:2019) s (115:73)
Command line args:
00: |0’s|
这里展示了几个问题:
- 传入的参数不需要加引号(我本来就没想过要加)
- Windows 正确地将 UTF-8 编码的参数传递给应用程序
但是我无论如何都想不明白为什么 Perl 在这一点上没有接收到 UTF-8 的参数。
当然,作为一个绝对的“hack”,如果我在 Perl 脚本的底部添加以下内容,问题就会得到解决。但我想知道为什么 Perl 没有接收到 UTF-8 的参数:
# ... Appended to original script shown at top ...
use Encode qw(encode decode);
sub recode
{
return encode('UTF-8', decode( 'cp1252', $_[0] ));
}
say "\n@{['='x60]}\n"; # Output separator
say "Original arg: $arg";
say "After recoding CP1252 -> UTF-8: ${\recode($arg)}";
脚本执行:
V:\videos>c:\perl\5.32.0\bin\perl test.pl 0’s
新输出:
ENV: LC_ALL=en_US.UTF-8 LC_CTYPE={unset} LANG={unset}
Active code page: 65001
Argument: 0s length: 3
Argument hex bytes: 0030 0092 0073
============================================================
Original arg: 0s
After recoding CP1252 -> UTF-8: 0’s
更新
我建立了一个简单的C++测试应用程序,以更好地掌握发生的情况。
以下是源代码:
#include <cstdint>
#include <cstring>
#include <iostream>
#include <iomanip>
int main(int argc, const char *argv[])
{
if (argc!=2)
{
std::cerr << "A single command line argument is required\n";
return 1;
}
const char *arg=argv[1];
std::size_t arg_len=strlen(arg);
// Display argument as a string
std::cout << "Argument: " << arg << " length: " << arg_len << '\n';
// Display argument bytes
// Fill with leading zeroes
auto orig_fill_char=std::cout.fill('0');
std::cout << "Bytes of argument, in hex:";
std::cout << std::hex;
for (std::size_t arg_idx=0; arg_idx<arg_len; ++arg_idx)
{
// Note: The cast to uint16_t is necessary because uint8_t is formatted
// "specially" (i.e., still as a char and not as an int)
// The cast through uint8_t is necessary due to sign extension of
// the original char if going directly to uint16_t and the (signed) char
// value is negative.
// I could have also masked off the high byte after the cast, with
// insertion code like (Note: Parens required due to precedence):
// << (static_cast<uint16_t>(arg[arg_idx]) & 0x00ff)
// As they say back in Perl-land, "TMTOWTDI!", and in this case it
// amounts to the C++ version of Perl "line noise" no matter which
// way you slice it. :)
std::cout << ' '
<< std::setw(2)
<< static_cast<uint16_t>(static_cast<uint8_t>(arg[arg_idx]));
}
std::cout << '\n';
// Restore the original fill char and go back to decimal mode
std::cout << std::setfill(orig_fill_char) << std::dec;
}
上述代码是以MBCS字符集设置为基础的64位控制台应用程序运行的:
testapp.exe 0’s
...,并生成了以下输出:
Argument: 0s length: 3
Argument bytes: 30 92 73
所以,它最终还是Windows系统,至少在某种程度上。我需要构建一个UNICODE字符集版本的应用程序并查看结果。
最终解决方案
感谢Eryk Sun对ikegami的回答和其中的链接的评论,我已经找到了最佳解决方案,至少在Windows 10上是如此。现在我将概述具体步骤,以强制Windows将命令行参数作为UTF-8发送到Perl:
需要向perl.exe和wperl.exe(如果您使用它)添加清单,告诉Windows在执行perl.exe应用程序时使用UTF-8作为活动代码页(ACP)。这将告诉Windows将命令行参数传递给perl作为UTF-8而不是CP1252。
需要进行的更改
创建清单文件
前往你的 perl.exe
(和 wperl.exe
) 所在的位置,在该 (...\bin
) 目录中创建一个文件,内容如下,并将其命名为 perl.exe.manifest
:
<?xml version="1.0" encoding="UTF-8" standalone="yes"?>
<assembly manifestVersion="1.0" xmlns="urn:schemas-microsoft-com:asm.v1">
<assemblyIdentity type="win32" name="perl.exe" version="6.0.0.0"/>
<application>
<windowsSettings>
<activeCodePage
xmlns="http://schemas.microsoft.com/SMI/2019/WindowsSettings"
>UTF-8</activeCodePage>
</windowsSettings>
</application>
</assembly>
如果您还想修改wperl.exe
,请将上述文件perl.exe.manifest
复制到wperl.exe.manifest
并修改该文件,替换assemblyIdentity
行:
<assemblyIdentity type="win32" name="perl.exe" version="6.0.0.0"/>
使用 wperl.exe
替换 perl.exe
分配给 name
属性的值,代码如下:
<assemblyIdentity type="win32" name="wperl.exe" version="6.0.0.0"/>
将清单嵌入可执行文件中
下一步是将刚刚创建的清单文件嵌入到它们各自的可执行文件中。在执行此操作之前,请务必备份原始可执行文件,以防万一!
可以按照以下方式将清单嵌入可执行文件中:
对于 perl.exe
:
mt.exe -manifest perl.exe.manifest -outputresource:perl.exe;#1
对于 wperl.exe
(可选项,仅在使用 wperl.exe
时需要):
mt.exe -manifest wperl.exe.manifest -outputresource:wperl.exe;#1
如果您还没有mt.exe
可执行文件,它可以作为Windows 10 SDK的一部分找到,目前位于:在developer.microsoft.com下载Windows 10 SDK
基本测试和用法
在进行以上更改后,UTF-8命令行参数变得非常容易!
使用以下脚本simple-test.pl
:
use strict;
use warnings;
use v5.32; # Or whatever recent version of Perl you have
# Helper subroutine to provide simple hex table output formatting
sub hexdump
{
my ($arg)=@_;
sub BYTES_PER_LINE {16}; # Output 16 hex pairs per line
for my $chr_idx (0 .. length($arg)-1)
{
# Break into groups of 16 hex digit pairs per line
print sprintf('\n %02x: ', $chr_idx+1/BYTES_PER_LINE)
if $chr_idx%BYTES_PER_LINE==0;
print sprintf('%02x ',ord(substr($arg,$chr_idx,1)));
}
say '';
}
# Test app code that makes no mention of Windows, ACPs, or UTF-8 outside
# of stuff that is printed. Other than the call out to chcp to get the
# active code page for informational purposes, it is not particularly tied
# to Windows, either, as long as whatever environment it is run on
# passes the script its arg as UTF-8, of course.
my $arg=shift @ARGV or die 'No argument present';
say "Argument: $arg";
say "Argument byte length: ${\length($arg)} bytes";
print 'Argument UTF-8 data bytes in hex:';
hexdump($arg);
让我们测试一下脚本,确保我们处于UTF-8代码页(65001):
v:\videos>chcp 65001 && perl.exe simple-test.pl "Работа с ’ vis-à-vis 0's using UTF-8"
输出(假设您的控制台字体可以处理特殊字符):
Active code page: 65001
Argument: Работа с ’ vis-à-vis 0's using UTF-8
Argument byte length: 54 bytes
Argument UTF-8 data bytes in hex:
00: d0 a0 d0 b0 d0 b1 d0 be d1 82 d0 b0 20 d1 81 20
10: f0 9d 9f 98 e2 80 99 f0 9d 99 a8 20 76 69 73 2d
20: c3 a0 2d 76 69 73 20 30 27 73 20 75 73 69 6e 67
30: 20 55 54 46 2d 38
我希望我的解决方案能够帮助其他遇到这个问题的人。
ReadConsoleW
,WriteConsoleW
)并在UTF-16和UTF-8之间进行转码。Python实现了这一点。Perl是否也实现了呢? - Eryk Sun\x{92}
- 它在表格中是第9_
行,第_2
列。 - Michael Goldshteyn