UTF-8中汉字的上限和下限是什么？

Question

UTF-8中汉字的上限和下限是什么？

4

我希望用Python生成一个包含所有汉字ord()的集合：

对于英文，等效的代码如下：

english = set(range(ord('a'),ord('z') + 1 ) +
              range(ord('A'),ord('Z') + 1 ))

- 0x90

2

你不想直接在UTF-8中进行操作，而是要生成Unicode代码点并将它们转换为UTF-8。 - Mark Ransom

2

您可以在这里找到所需的内容：http://unicode.org/charts/ - Mark Ransom

2

汉字在Unicode中分散存在于多个不同的集合中。 - Ignacio Vazquez-Abrams

有许多中文范围可用，但一些平台（遗憾的是，不包括Python）允许您查询脚本的代码点范围。 - tripleee

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Ian Clelland · Accepted Answer

根据Unicode标准（v6.0，第12.1节），

汉字表意字符分布在Unicode标准的七个主要块中，详见表12-2。

Table 12-2. Blocks Containing Han Ideographs

Block                                   | Range       | Comment
----------------------------------------+-------------+-----------------------------------------------------
CJK Unified Ideographs                  | 4E00–9FFF   | Common
CJK Unified Ideographs Extension A      | 3400–4DBF   | Rare
CJK Unified Ideographs Extension B      | 20000–2A6DF | Rare, historic
CJK Unified Ideographs Extension C      | 2A700–2B73F | Rare, historic
CJK Unified Ideographs Extension D      | 2B740–2B81F | Uncommon, some in current use
CJK Compatibility Ideographs            | F900–FAFF   | Duplicates, unifiable variants, corporate characters
CJK Compatibility Ideographs Supplement | 2F800–2FA1F | Unifiable variants

还有一些额外的内容，在这些块之外：

Table 12-3. Small Extensions to the URO

Range     | Version | Comment
----------+---------+-------------------------------------------------
9FA6–9FB3 | 4.1     | Interoperability with HKSCS standard
9FB4–9FBB | 4.1     | Interoperability with GB 18030 standard
9FBC–9FC2 | 5.1     | Interoperability with commercial implementations
9FC3      | 5.1     | Correction of mistaken unification
9FC4–9FC6 | 5.2     | Interoperability with ARIB standard
9FC7–9FCB | 5.2     | Interoperability with HKSCS standard

要使用集合运算构建这些序数值的集合，可以这样做：

chinese = set(range(0x4E00, 0xA000) +
              range(0x3400, 0x4DC0) +
              range(0x20000, 0x2A6E0) +
              range(0x2A700, 0x2B740) +
              range(0x2B740, 0x2B820) +
              range(0xF900, 0xFB00) +
              range(0x2F800, 0x2FA20) +
              range(0x9FA6, 0x9FCC))

需要注意的是，该集合包含超过75000个字符，因此可能不是最紧凑或最高效的数据结构。

此外，如果您坚持对文字使用ord()函数，您需要使用32位unicode文字字面形式：

>>> ord(u'\U00002F800')
194560