无需使用UTF8标记的字符串解码JSON

5

我正在使用一些期望拉丁字母表的字符串数据(并忽略perl的UTF8标志)的XS模块。在某些情况下,我正在传递JSON解码的结果,它应该仅包含拉丁字母表中的字符,但在某些情况下已被转义(例如["co\u00f6perative"])。

是否有一种JSON解码模块提供了一个选项,可以降级返回字符串(至少在可能的情况下)?我没有在JSON、JSON::XS或Cpanel::JSON::XS中找到这样的选项。

use strict;
use warnings;
use Cpanel::JSON::XS;
use Devel::Peek;
my $got = Cpanel::JSON::XS->new->decode('["co\u00f6perative"]')->[0];
Dump $got;
my $wanted = $got;
utf8::downgrade($wanted);
Dump $wanted;

输出:

SV = PV(0xd6cbf0) at 0xd8a460
  REFCNT = 1
  FLAGS = (POK,IsCOW,pPOK,UTF8)
  PV = 0xd83b40 "co\303\266perative"\0 [UTF8 "co\x{f6}perative"]
  CUR = 12
  LEN = 14
  COW_REFCNT = 0
SV = PV(0xd6cb20) at 0xd977f0
  REFCNT = 1
  FLAGS = (POK,pPOK)
  PV = 0xe0d120 "co\366perative"\0
  CUR = 11
  LEN = 14

可能不是。你是否不愿意只是遍历数据结构并重新编码字符串? - Borodin
1
@Borodin,恐怕我最终会这样做;只是问一下,以防有人在某个我还没找到的JSON模块中解决了这个问题。 - ysth
如果有一个JSON解析器在各种语法点上提供钩子,那就太好了。没有缺少做到这一点的XML解析器,甚至File::Find也有类似的功能。也许我会写点什么。 - Borodin
2个回答

7
你可以使用猴子补丁(JSON::PP)来实现所需的效果。
use JSON::PP qw( );

use vars qw( $JSON_PP_DOWNGRADE );

BEGIN {
   $JSON_PP_DOWNGRADE //= 0;
   my $old_string = \&JSON::PP::string;
   my $new_string = sub {
      my $s = $old_string->(@_);
      utf8::downgrade($s) if $JSON_PP_DOWNGRADE;
      $s
   };

   no warnings qw ( redefine );
   *JSON::PP::string = $new_string;
}

如果您希望JSON::PP生成“降级结构”,请在调用decode之前添加以下内容:

local $JSON_PP_DOWNGRADE = 1;

5
最安全的方法是在事后修复数据结构。
# The following apply to each of decode_struct_inplace, encode_struct_inplace, downgrade_struct_inplace and upgrade_struct_inplace:
# - Errors are silently ignored. The scalar is left unchanged.
# - Recognizes references to arrays, hashes and scalars. More esoteric references won't processed, and a warning will be issued.
# - Overloaded objects and magical variables are not supported. They may induce incorrect behaviour.
# - The structure is changed in-place. You can use Storable::dclone to make a copy first if need be.
# - For convenience, returns its argument.

# Decodes all strings in a data structure from UTF-8 to Unicode Code Points.
sub decode_struct_inplace { _convert_struct_inplace($_[0], \&utf8::decode) }

# Encodes all strings in a data structure from Unicode Code Points to UTF-8.
sub encode_struct_inplace { _convert_struct_inplace($_[0], \&utf8::encode) }

# "Downgrades" the string storage format of all scalars containing strings in
# a data structure to the UTF8=0 format if they aren't already in that format.
sub downgrade_struct_inplace { _convert_struct_inplace($_[0], \&utf8::downgrade) }

# "Upgrades" the string storage format of all scalars containing strings in
# a data structure to the UTF8=1 format if they aren't already in that format.
sub upgrade_struct_inplace { _convert_struct_inplace($_[0], \&utf8::upgrade) }

sub _convert_struct_inplace {
    # Make $arg an alias to $_[0]. Changes to $arg (like changes to $_[0]) will be reflected in the parent.
    our $arg; local *arg = \shift;
    my $converter        =  shift;

    my $caller = (caller(1))[3];
    $caller =~ s/^.*:://;    # /

    my %seen;    # Only decode each variable once.
    my %warned;  # Only emit each warning once.

    # Using "my" would introduce a memory cycle we'd have to work to break to avoid a memory leak.
    local *_visitor = sub {
        # Make $arg an alias to $_[0]. Changes to $arg (like changes to $_[0]) will be reflected in the parent.
        our $arg; local *arg = \$_[0];

        # Don't decode the same variable twice.
        # Also detects referential loops.
        return $arg if $seen{refaddr(\$arg)}++;

        my $reftype = reftype($arg);
        if (!defined($reftype)) {
            if (defined($arg)) {
                my $sv = B::svref_2object(\$arg);  # Meta object.
                if ($sv->isa('B::PV') && ($sv->FLAGS & B::SVf_POK)) {  # Can it contain a string? And does it?
                    $converter->($arg);
                }
            }
        }
        elsif ($reftype eq 'ARRAY') {
            _visitor($_) for @$arg;
        }
        elsif ($reftype eq 'HASH') {
            # Usually, we can avoid converting the keys.
            my $ascii = 1;
            for (keys(%$arg)) {
                if (/[^\x00-\x7F]/) {
                    $ascii = 0;
                    last;
                }
            }

            if (!$ascii) {
                %$arg = map {
                        $converter->( my $new_key = $_ );
                        $new_key => $arg->{$_}
                    } keys(%$arg);
            }

            _visitor($_) for values(%$arg);
        }
        elsif ($reftype eq 'SCALAR') {
            _visitor($$arg);
        }
        elsif ($reftype eq 'REF') {
            _visitor($$arg);
        }
        else {
            warn("Reference type $reftype not supported by $caller\n")
                if !$warned{$reftype}++;
        }

        return $arg;
    };

    return _visitor($arg);
}

这是现有的代码,可以简化一下,因为它处理的内容不在由JSON模块创建的数据结构中。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接