如何使用Perl或其他编程语言对CJK（亚洲）字符进行排序？

Question

如何使用Perl或其他编程语言对CJK（亚洲）字符进行排序？

11

如何在Perl中对中日韩（CJK）字符进行排序？

据我所知，按笔画数排序，然后按偏旁部首排序，似乎是这些语言的排序方式。还有一些按照发音排序的方法，但这种方法较不常见。

我尝试使用：

perl -e 'print join(" ", sort qw(工 然 一 人 三 古 二 )), "\n";'
# Prints: 一 三 二 人 古 工 然 which is incorrect

我尝试使用来自CPAN的Unicode::Collate，但它说：

默认情况下，CJK统一表意符号按Unicode代码点顺序排序...

如果我能获得每个字符的笔画数数据库，那么我就可以轻松地对所有字符进行排序，但是似乎Perl中没有这个功能，也没有任何我能找到的模块进行封装。

如果您知道如何在其他语言中对CJK进行排序，则在回答此问题时提及将有所帮助。

- Neil

1

这是一个愚蠢的问题。 "如何对中文单词进行排序？"或"如何对韩语单词进行排序？"会有意义，但是"如何对CJK字符进行排序？"没有任何意义。 - user181548

这是非常合理的，因为在大多数支持多种亚洲语言的字符映射中，中文、日文和韩文被归为“CJK”。 - Andy

3个回答

2

一个日本电话簿是按音标排序的（五十音顺序）。但是，汉字的顺序不是基于音标的，无论是在Unicode、JIS、S-JIS还是EUC中。只有假名是基于音标顺序的。这意味着如果没有进行音标转换，你无法有意义地进行排序！例如：

a) kanji:           東京駅
b) kana converted:  とうきょうえき
c) romanisation:    tôkyô eki

使用b)或c)可以进行有意义的排序。但是仅使用a)无法实现。当然，您可以运行普通的排序函数，但对于日语来说并没有意义。

- kmugitani

这回答了一个合理的问题，“你如何对日语单词进行排序？”，但它并没有回答实际提出的问题，所以我不能点赞它。 - user181548

@Kinopiko：是的，我必须同意你的观点。原问题不太好。 - kmugitani

2

请查看我的Ruby宝石toPinyin，它可以将UTF-8编码的中文字符转换为它们的拼音（发音）。然后，可以轻松地对拼音进行排序。

简单地说，gem install toPinyin

require 'toPinyin'

words = "
人
没有
理想
跟
咸鱼
有
什么
区别
".split("\n")

words.sort! {|a ,b|   a.pinyin.join <=> b.pinyin.join }

https://github.com/pierrchen/toPinyin

- pierrotlefou

你是怎么获取这个数据的？ - Pacerier

我不知道Ruby，但对于Python来说，它就像https://github.com/avian2/unidecode一样简单。 - Polv

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- daxim · Accepted Answer

请参阅TR38获取详细信息和边角案例。这并不像你想象的那么简单，也不像这个代码示例看起来那么简单。

use 5.010;
use utf8;
use Encode;
use Unicode::Unihan;
my $u = Unicode::Unihan->new;

say encode_utf8 sprintf "Character $_ has the radical #%s and %d residual strokes." , split /[.]/, $u->RSUnicode($_) for qw(工 然 一 人 三 古 二);
__END__
Character 工 has the radical #48 and 0 residual strokes.
Character 然 has the radical #86 and 8 residual strokes.
Character 一 has the radical #1 and 0 residual strokes.
Character 人 has the radical #9 and 0 residual strokes.
Character 三 has the radical #1 and 2 residual strokes.
Character 古 has the radical #30 and 2 residual strokes.
Character 二 has the radical #7 and 0 residual strokes.

请查看 http://en.wikipedia.org/wiki/List_of_Kangxi_radicals，了解部首序号和笔画数的对应关系。