正则表达式和索引不匹配Unicode字符

Question

正则表达式和索引不匹配Unicode字符

5

我正在编写一个库，其中一个函数返回一个字符串。当尝试使用正则表达式或 index 函数查找 Unicode 字符时，该字符串存在问题。该字符串的打印结果在 Sublime text 的控制台上可以正常显示 Unicode 字符，例如：

<xml>V日한ế</xml>

我正在尝试这样匹配它：$string =~ m/V日한ế/，并且我正在使用utf8。

很抱歉我无法提供一个最小的破解示例，因为当我自己构造字符串并尝试匹配时，一切都正常。我尝试使用这个网站上的hexdump函数，但它对于库返回的字符串中的unicode字符和我构造的字符串（$string2 = 'V日한ế'）打印相同的十六进制序列：56 e6 97 a5 ed 95 9c e1 ba bf。来自库的那个关闭了utf标志，而构造的那个没有，但另一个测试表明这不是问题所在。

我只有一个线索可以找到问题的源头：使用use re 'debug';输出。它给出以下消息：

Matching REx "V%x{65e5}%x{d55c}%x{1ebf}" against "%n<xml>V%x{e6}%x{97}%x{a5}%x{ed}%x{95}%x{9c}%x{e1}%x{ba}"...

在正则表达式中，字符“日”被打印为%x{65e5}，在有问题的字符串中，同样的字符被打印为%x{e6}%x{97}。其他Unicode字符也会以不同的方式打印。

有经验的字符串和编码调试人员能否告诉我为什么正则表达式和index无法找到我知道存在于字符串中的Unicode字符，以及如何使这些函数找到它们？

- Nate Glenn

看起来你的输入是utf16编码的，请将其解码为Perl内部格式。 - Suic

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- amon · Accepted Answer

让我们创建一个可重现的测试案例：

generating an input file:

$ perl -E'say "<xml>V\xe6\x97\xa5\xed\x95\x9c\xe1\xba\xbf</xml>"' >test.xml
$ cat test.xml
<xml>V日한ế</xml>

This writes some bytes to a file. Note that my terminal emulator uses UTF-8.

Trying to naively match the input:

$ cat test.pl
use strict; use warnings; use utf8; use autodie; use feature 'say';
open my $fh, "<", shift @ARGV;

my $s = <$fh>;
say "$s ", $s =~ m/V日한ế/ ? "matches" : "doesn't match";
say "string = ", map { sprintf "\\x{%x}", ord } split //, $s;
$ perl test.pl test.xml
<xml>V日한ế</xml>
 doesn't match
string = \x{3c}\x{78}\x{6d}\x{6c}\x{3e}\x{56}\x{e6}\x{97}\x{a5}\x{ed}\x{95}\x{9c}\x{e1}\x{ba}\x{bf}\x{3c}\x{2f}\x{78}\x{6d}\x{6c}\x{3e}\x{a}

Oh, so the string from the file is seen as a string of bytes, not properly decoded codepoints. Who would have guessed?

Let's add the :utf8 PerlIO-layer:

$ cat test-utf8.pl
use strict; use warnings; use utf8; use autodie; use feature 'say';
open my $fh, "<:utf8", shift @ARGV;

my $s = <$fh>;
say "$s ", $s =~ m/V日한ế/ ? "matches" : "doesn't match";
say "string = ", map { sprintf "\\x{%x}", ord } split //, $s;
$ perl test-utf8.pl test.xml
Wide character in say at test-utf8.pl line 5, <$_[...]> line 1.
<xml>V日한ế</xml>
 matches
string = \x{3c}\x{78}\x{6d}\x{6c}\x{3e}\x{56}\x{65e5}\x{d55c}\x{1ebf}\x{3c}\x{2f}\x{78}\x{6d}\x{6c}\x{3e}\x{a}

Now it matches, because we have read the correctly decoded codepoints from the file.

你是否得到了相同的输出？如果没有得到可比较的输出，那么你使用的是哪个perl/操作系统组合（这是在Ubuntu GNU/Linux上的perl 5.18.1）。

这段代码还有一些问题：有多种方式来表示ế。因此，你应该在正则表达式和输入中规范化字符串：

use Unicode::Normalize 'NFC';
my $regex_body = NFC "V日한ế";
my $s          = NFC scalar <$fh>;

... m/\Q$regex_body/ ...