比较日语字符的问题

Question

比较日语字符的问题

3

我正在努力使用HTML::TokeParser来解析一个包含日语字符的HTML文档。

这是我的代码：

use utf8;

use Encode qw(decode encode is_utf8);
use Encode::Guess;
use Data::Dumper;
use LWP::UserAgent;
use HTTP::Cookies;
use Cwd;
use HTML::TokeParser;

my $local_dir = getcwd;

my $browser = LWP::UserAgent->new();

my $cookie_jar = HTTP::Cookies->new(
   file     => $local_dir . "/cookies.lwp",
   autosave => 1,
);

$browser->cookie_jar( $cookie_jar );

push @{ $browser->requests_redirectable }, 'POST';
$browser->requests_redirectable;

my $response = $browser->get("http://www.yahoo.co.jp/");
my $html = $response->content;
print $html;
utf8::decode($html);

my $p = HTML::TokeParser->new( \$html );

# dispatch table with subs to handle the different types of tokens

my %dispatch = (
   S  => sub { $_[0]->[4] }, # Start tag
   E  => sub { $_[0]->[2] }, # End tag
   T  => sub { $_[0]->[1] }, # Text
   C  => sub { $_[0]->[1] }, # Comment
   D  => sub { $_[0]->[1] }, # Declaration
   PI => sub { $_[0]->[2] }, # Process Instruction
);

while ( my $token = $p->get_tag('a') ) {
        print $p->get_trimmed_text if $p->get_trimmed_text eq '社会的責任';
        print "\n";
}

这在我的终端上没有显示任何内容，但如果我只是执行print $p->get_trimmed_text，那么输出就正常。

以下是几行十六进制转储，对应于print $p->get_trimmed_text：

0000000 490a 746e 7265 656e 2074 7845 6c70 726f
0000010 7265 81e3 e4ae 92ba 8fe6 e89b a8a1 a4e7
0000020 e3ba ab81 81e3 e3a4 8481 81e3 0aa6 9fe7
0000030 e5b3 9db7 81e9 e3bc 8982 9be5 e5bd 8586
0000040 a4e5 e396 ae81 83e3 e397 ad83 82e3 e3b4
0000050 ab83 83e3 e395 a182 83e3 e3bc 8c81 86e7
0000060 e68a ac9c 94e6 e6af b48f 320a e334 ab82
0000070 89e6 e380 ae81 b4e7 e885 8991 90e5 e68d
0000080 8089 82e3 e692 a597 b8e5 e3b0 8a82 82e3
0000090 e3b3 bc83 82e3 e4b9 95bb abe7 e38b a681
00000a0 81e3 e7a7 b9b4 bbe4 0a8b 83e3 e39e af82
00000b0 83e3 e389 8a83 83e3 e3ab 8983 82e3 e384
00000c0 8783 83e3 e38b bc83 82e3 e3ba ae81 81e3
00000d0 e58a 97be 81e3 e3aa af82 83e3 e3bc 9d83
00000e0 83e3 e9b3 8d85 bfe4 0aa1 a8e8 e88e 96ab
00000f0 bce4 e39a 8c80 83e3 e392 a983 83e3 e3aa
0000100 bc83 b0e6 e58f 9d8b 88e5 e3a9 8d80 3235
0000110 e525 9986 9ce7 4e9f 5745 e50a a7a4 98e9

看起来比较操作无法正常工作。

我只能使用HTML::TokeParser，因为这是服务器上唯一安装的模块，我不能安装其他东西。

- user2360915

HTML页面是否使用与您代码中的字符串不同的Unicode规范化？请参见http://www.modernperlbooks.com/mt/2013/01/why-unicode-normalization-matters.html。 - tripleee

如何进行验证？如果我使用链接中提供的这段代码：Unicode::Normalize; use open qw/:std :utf8/;即使在不进行比对的情况下，输出也会成为垃圾。 - user2360915

哦，这是什么垃圾？使用的是哪个Perl版本？不管怎样，这只是一个想法 - 我并不真正假装知道这里出了什么问题。 - tripleee

Perl v5.20.2。 - user2360915

无法猜测编码的情况下，几个字节的十六进制转储将更有帮助。另请参阅Stack Overflow character-encoding标签wiki获取一些提示。 - tripleee

好的。即使没有使用utf8::decode($html)，垃圾数据仍然会发生。但是比较仍然无法正常工作。 - user2360915

2个回答

1

请查看ikegami的答案。我的方法只是一种替代方案，无法解决你代码中的实际问题。

Unicode::Collate来拯救！

请注意，我在您的代码下方添加了以下内容。

use Unicode::Collate;
use open qw/:std :utf8/;
my $Collator = Unicode::Collate->new();
sub compare_strs
{
    my ( $str1, $str2 ) = @_;
    # Treat vars as strings by quoting. 
    # Possibly incorrect/irrelevant approach. 
    return $Collator->cmp("$str1", "$str2");
}

注意: compare_strs子程序将返回1（当$str1大于$str2时）或0（当$str1等于$str2时）或-1（当$str1小于$str2时）。

以下是完整的工作代码：

use strict;
use warnings;
use utf8;
use Unicode::Collate;
use open qw/:std :utf8/;
use Encode qw(decode encode is_utf8);
use Encode::Guess;
use Data::Dumper;
use LWP::UserAgent;
use HTTP::Cookies;
use Cwd;
use HTML::TokeParser;
my $local_dir = getcwd;
my $browser = LWP::UserAgent->new();
my $cookie_jar = HTTP::Cookies->new(
   file     => $local_dir . "/cookies.lwp",
   autosave => 1,
);
$browser->cookie_jar( $cookie_jar );
push @{ $browser->requests_redirectable }, 'POST';
$browser->requests_redirectable;
my $Collator = Unicode::Collate->new();
sub compare_strs
{
    my ( $str1, $str2 ) = @_;
    # Treat vars as strings by quoting. 
    # Possibly incorrect/irrelevant approach. 
    return $Collator->cmp("$str1", "$str2");
}
my $response   = $browser->get("http://www.yahoo.co.jp/");
my $html = $response->content;
#print $html;
utf8::decode($html);
my $p = HTML::TokeParser->new( \$html );

# dispatch table with subs to handle the different types of tokens
my %dispatch = (
   S  => sub { $_[0]->[4] }, # Start tag
   E  => sub { $_[0]->[2] }, # End tag
   T  => sub { $_[0]->[1] }, # Text
   C  => sub { $_[0]->[1] }, # Comment
   D  => sub { $_[0]->[1] }, # Declaration
   PI => sub { $_[0]->[2] }, # Process Instruction
);

my $string = '社会的責任';
while ( my $token = $p->get_tag('a') ) {
        my $text = $p->get_trimmed_text;
        unless (compare_strs($text, $string)){
          print $text;
          print "\n";
        }
}

输出：

chankeypathak@perl:~/Desktop$ perl test.pl 
社会的責任

- Chankey Pathak

2

哇！太感谢了！太棒了。你帮我省了一天的时间。 - user2360915

嗯，我刚刚尝试了与雅虎相同的东西，但在本地服务器上不起作用，compare_strs始终返回-1 :( - user2360915

我建议您发布一个带有最小完整可验证示例的新问题。 - Chankey Pathak

1

我找到了问题所在。从本地服务器返回的编码不是utf-8。幸好有Encode::Guess，我成功将其转换为utf8！再次感谢您的帮助！ - user2360915

这里绝对没有理由使用Unicode::Collate。 - ikegami

显示剩余2条评论

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- ikegami · Accepted Answer

你期望两次调用 $p->get_trimmed_text 返回相同的字符串，但实际上每次调用都返回一个不同的标记。请替换。

print $p->get_trimmed_text if $p->get_trimmed_text eq '社会的責任';

使用

my $text = $p->get_trimmed_text;
print $text if $text eq '社会的責任';

您不应该假定HTML使用UTF-8进行编码。请替换

my $html = $response->content;
utf8::decode($html);

使用

my $html = $response->decoded_content;

此外，您还需要对输出进行编码。一种方法是添加以下内容：

use encode ':std', ':encoding(UTF-8)';