在Perl中测试查询字符串的Unicode处理

Question

在Perl中测试查询字符串的Unicode处理

3

我试图编写一个测试查询字符串解析的示例，但在Unicode问题上遇到了困难。简而言之，“欧米茄”字母（Ω）似乎不能正确解码。

Unicode: U+2126
3字节序列：\xe2\x84\xa6
URI编码：%E2%84%A6

因此，我编写了这个测试程序来验证我是否可以使用URI::Encode“解码”Unicode查询字符串。

use strict;                                                                                                                                                                    
use warnings;
use utf8::all;    # use before Test::Builder clones STDOUT, etc.
use URI::Encode 'uri_decode';
use Test::More;

sub parse_query_string {
    my $query_string = shift;
    my @pairs = split /[&;]/ => $query_string;

    my %values_for;
    foreach my $pair (@pairs) {
        my ( $key, $value ) = split( /=/, $pair );
        $_ = uri_decode($_) for $key, $value;
        $values_for{$key} ||= [];
        push @{ $values_for{$key} } => $value;
    }
    return \%values_for;
}

my $omega = "\N{U+2126}";
my $query = parse_query_string('alpha=%E2%84%A6');
is_deeply $query, { alpha => [$omega] }, 'Unicode should decode correctly';

diag $omega;
diag $query->{alpha}[0];

done_testing;

测试的输出结果如下：

query.t .. 
not ok 1 - Unicode should decode correctly
#   Failed test 'Unicode should decode correctly'
#   at query.t line 23.
#     Structures begin differing at:
#          $got->{alpha}[0] = 'â¦'
#     $expected->{alpha}[0] = 'Ω'
# Ω
# â¦
1..1
# Looks like you failed 1 test of 1.
Dubious, test returned 1 (wstat 256, 0x100)
Failed 1/1 subtests 

Test Summary Report
-------------------
query.t (Wstat: 256 Tests: 1 Failed: 1)
  Failed test:  1
  Non-zero exit status: 1
Files=1, Tests=1,  0 wallclock secs ( 0.03 usr  0.01 sys +  0.05 cusr  0.00 csys =  0.09 CPU)
Result: FAIL

在我看来，URI::Encode可能在此处出现了问题，但切换到URI :: Escape并使用uri_unescape函数会报告相同的错误。我错过了什么？

- Ovid

1

CGI 模块提供了 pragma import -utf8 以自动解码输入。这个功能可以正常工作：perl -e'use CGI qw(-utf8); my $cgi = CGI->new("alpha=%E2%84%A6"); use Devel::Peek; Dump $cgi->param("alpha")'。请注意文档中提到的警告。 - daxim

4个回答

5

URI转义表示八位字节，不涉及字符编码，因此您需要自己从UTF-8字节解码为字符，例如：

$_ = decode_utf8(uri_decode($_)) for $key, $value;

- ilmari

4

问题在于您对问题的解释中存在不正确的细节。实际上，你正在处理的是：

Unicode代码点：U+2126
UTF-8编码的代码点：\xe2\x84\xa6
URI编码的UTF-8编码的代码点：%E2%84%A6

问题在于您只撤销了其中一种编码。

已经提出了解决方案，我只是想提供另一种解释。

- ikegami

0

我建议您查看为什么现代Perl默认避免使用UTF-8？，以深入讨论此主题。

我想在那里增加讨论：

您会注意到页面上有很多奇怪的字形。这是作者故意为之。
我尝试了线程中推荐的Symbola字体，在Win 7上看起来很糟糕。可能因人而异。
频繁阅读为什么现代Perl默认避免使用UTF-8？可能会导致抑郁和对生活选择的持续怀疑。

- converter42

1

我曾经阅读过这篇内容，我认为tchrist的回答非常出色。 - Ovid

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- miyagawa · Accepted Answer

URI编码的字符仅表示UTF-8序列，而URI::Encode和URI::Escape仅将其解码为UTF-8字节字符串，并且它们都不将字节串解码为UTF-8（作为通用URI解码库的正确行为）。

换句话说，你的代码基本上是这样的：is "\N{U+2126}", "\xe2\x84\xa6"，并且这将失败，因为在比较时，Perl会将后者升级为3个字符长度的Latin-1字符串。

在uri_decode之后，您必须手动使用Encode::decode_utf8解码输入值，或者改为比较编码的utf8字节序列。