Java中对应的Unicode括号是什么?

4

在Java中,括号被标记为START_PUNCTUATIONEND_PUNCTUATION字符类型。

如果已经有了"[",如何计算出对应的"]"(不使用硬编码表)?


你可以通过比较规范名称,从Unicode数据库预编译一组匹配对。我想它们相当系统化(例如剥离"LEFT"和"RIGHT")。 - Kerrek SB
2个回答

5

假设“每个起始标点符号在1-3个代码点后都有一个相应的结束标点符号,如果它有一个”,这似乎是正确的,此代码段应该列出每个可能的字符的标点符号:

public class EndPunct {
    private static final int UNICODE_MAX = Character.MAX_CODE_POINT;

    public static void main(String args[]) {
        for (int i = 0; i < UNICODE_MAX; i++) {
            if (!Character.isValidCodePoint(i)) {
                continue;
            }
            if (Character.getType(i) == Character.START_PUNCTUATION) {
                Character.UnicodeBlock currentBlock = Character.UnicodeBlock.of(i);
                boolean found = false;
                for (int newchar = i+1 ; newchar < Math.min(UNICODE_MAX, i+3); newchar++) {
                    if (!(Character.UnicodeBlock.of(newchar).equals(currentBlock))) {
                        break;
                    }
                    if (Character.getType(newchar) == Character.END_PUNCTUATION) {
                        System.out.println(toChar(i) + " matches " + toChar(newchar)
                                + " (codepoints u+" + Integer.toHexString(i) + " and u+" +Integer.toHexString(newchar) + ")");  
                        found = true;
                        break;
                    }
                }
                if (!found) {
                    System.out.println("NOT FOUND for " + toChar(i) + " [position u+" + Integer.toHexString(i) + "]");
                }
            }

        } 
    }
    public static String toChar(int codePoint) {
        return new String(Character.toChars(codePoint));
    }
}

从输出结果可以看出,这个方法对于除两个字符外的其他字符似乎都有效:

( matches ) (codepoints u+28 and u+29)
[ matches ] (codepoints u+5b and u+5d)
{ matches } (codepoints u+7b and u+7d)
༺ matches ༻ (codepoints u+f3a and u+f3b)
༼ matches ༽ (codepoints u+f3c and u+f3d)
᚛ matches ᚜ (codepoints u+169b and u+169c)
NOT FOUND for ‚ [position u+201a]
NOT FOUND for „ [position u+201e]
⁅ matches ⁆ (codepoints u+2045 and u+2046)
⁽ matches ⁾ (codepoints u+207d and u+207e)
₍ matches ₎ (codepoints u+208d and u+208e)
〈 matches 〉 (codepoints u+2329 and u+232a)
❨ matches ❩ (codepoints u+2768 and u+2769)
❪ matches ❫ (codepoints u+276a and u+276b)
❬ matches ❭ (codepoints u+276c and u+276d)
❮ matches ❯ (codepoints u+276e and u+276f)
❰ matches ❱ (codepoints u+2770 and u+2771)
❲ matches ❳ (codepoints u+2772 and u+2773)
❴ matches ❵ (codepoints u+2774 and u+2775)
⟅ matches ⟆ (codepoints u+27c5 and u+27c6)
⟦ matches ⟧ (codepoints u+27e6 and u+27e7)
⟨ matches ⟩ (codepoints u+27e8 and u+27e9)
⟪ matches ⟫ (codepoints u+27ea and u+27eb)
⟬ matches ⟭ (codepoints u+27ec and u+27ed)
⟮ matches ⟯ (codepoints u+27ee and u+27ef)
⦃ matches ⦄ (codepoints u+2983 and u+2984)
⦅ matches ⦆ (codepoints u+2985 and u+2986)
⦇ matches ⦈ (codepoints u+2987 and u+2988)
⦉ matches ⦊ (codepoints u+2989 and u+298a)
⦋ matches ⦌ (codepoints u+298b and u+298c)
⦍ matches ⦎ (codepoints u+298d and u+298e)
⦏ matches ⦐ (codepoints u+298f and u+2990)
⦑ matches ⦒ (codepoints u+2991 and u+2992)
⦓ matches ⦔ (codepoints u+2993 and u+2994)
⦕ matches ⦖ (codepoints u+2995 and u+2996)
⦗ matches ⦘ (codepoints u+2997 and u+2998)
⧘ matches ⧙ (codepoints u+29d8 and u+29d9)
⧚ matches ⧛ (codepoints u+29da and u+29db)
⧼ matches ⧽ (codepoints u+29fc and u+29fd)
⸢ matches ⸣ (codepoints u+2e22 and u+2e23)
⸤ matches ⸥ (codepoints u+2e24 and u+2e25)
⸦ matches ⸧ (codepoints u+2e26 and u+2e27)
⸨ matches ⸩ (codepoints u+2e28 and u+2e29)
〈 matches 〉 (codepoints u+3008 and u+3009)
《 matches 》 (codepoints u+300a and u+300b)
「 matches 」 (codepoints u+300c and u+300d)
『 matches 』 (codepoints u+300e and u+300f)
【 matches 】 (codepoints u+3010 and u+3011)
〔 matches 〕 (codepoints u+3014 and u+3015)
〖 matches 〗 (codepoints u+3016 and u+3017)
〘 matches 〙 (codepoints u+3018 and u+3019)
〚 matches 〛 (codepoints u+301a and u+301b)
〝 matches 〞 (codepoints u+301d and u+301e)
﴾ matches ﴿ (codepoints u+fd3e and u+fd3f)
︗ matches ︘ (codepoints u+fe17 and u+fe18)
︵ matches ︶ (codepoints u+fe35 and u+fe36)
︷ matches ︸ (codepoints u+fe37 and u+fe38)
︹ matches ︺ (codepoints u+fe39 and u+fe3a)
︻ matches ︼ (codepoints u+fe3b and u+fe3c)
︽ matches ︾ (codepoints u+fe3d and u+fe3e)
︿ matches ﹀ (codepoints u+fe3f and u+fe40)
﹁ matches ﹂ (codepoints u+fe41 and u+fe42)
﹃ matches ﹄ (codepoints u+fe43 and u+fe44)
﹇ matches ﹈ (codepoints u+fe47 and u+fe48)
﹙ matches ﹚ (codepoints u+fe59 and u+fe5a)
﹛ matches ﹜ (codepoints u+fe5b and u+fe5c)
﹝ matches ﹞ (codepoints u+fe5d and u+fe5e)
( matches ) (codepoints u+ff08 and u+ff09)
[ matches ] (codepoints u+ff3b and u+ff3d)
{ matches } (codepoints u+ff5b and u+ff5d)
⦅ matches ⦆ (codepoints u+ff5f and u+ff60)
「 matches 」 (codepoints u+ff62 and u+ff63)

U+201a代表单引号,点击此处了解详情;U+201e代表双引号,点击此处了解详情。对于这些字符,没有匹配的字符。对于其他字符,这种方法似乎是行之有效的,并且对于每个具有匹配项的字符似乎都有效。但是,这可能没有任何保证。


是的,这个方法可以工作,但是向前查找任意数量的代码点(10)是不可靠的。更好的想法是继续查找,直到遇到一个字符,其块(由Character.UnicodeBlock.of返回)与起始标点符号的块不同。 - VGR
@VGR 但是这样会产生不正确的结果,因为它会声称u+2046是所有u+201a、u+201e和u+2045的结束字符。尽管任意前瞻的限制是不可靠的,但它似乎能产生正确的结果。当然,这只是偶然的。 - eis
我修改了示例代码,以考虑代码块作为额外的检查,并具有前瞻限制为3。 - eis
UNICODE_MAX 可以被替换为 Character.MAX_CODE_POINT,后者自 Java 1.5 版本开始可用。 - A.H.
干得好。我趁机将输出编码为 UTF-8,这样适合字体的人就可以看到字符确实相互对应。 - Joni
根据Unicode数据库,201a和201e不是镜像字符。我猜这是因为引号的镜像在不同语言中有所不同。 - VGR

1
有一个叫做“bidi mirroring glyph”的字符属性,如果存在镜像图像,则为您提供字符的镜像图像。这个属性对于正确布局双向文本是必需的:在从右到左的语言中,开放括号必须向左打开,因此文本布局引擎必须使用关闭括号的字形 ),而不是原始文本中的字符字形。
不幸的是,标准的Java API没有提供访问镜像字形属性的方法,但ICU4J库可以,使用UCharacter.getMirror方法。
一个“大部分正确”的替代方法是从给定的开放字符开始,检查接下来的几个字符是否为闭合标点符号,并假设它是正确的镜像。阅读镜像数据,您可以看到大多数情况下镜像相邻,极少有例外(一个例外的例子:U+2298 CIRCLED DIVISION SLASH 是 U+29B8 CIRCLED REVERSE SOLIDUS 的镜像 - 不过这些字符不属于标点符号类别)。

START_PUNCTUATION 属于一般的 Unicode 类别 "Ps",所以如果你选择下一个 END_PUNCTUATION 类别实例("Pe"),为什么不能行呢? - eis
@eis 首先,似乎 Pe 的数量比 Ps 少一个字符 - 它们是否被指定为这样匹配的呢? - millimoose
2
@Joni 注意,似乎存在“双向镜像字形”属性:http://www.unicode.org/Public/UNIDATA/BidiMirroring.txt 这使得你的答案要么显然是错误的,要么至少是误导性的。(不清楚这是否是OP想要的关系,但它似乎可以完成工作,并且在我找到的任何Java API中都没有暴露。) - millimoose
感谢您的评论,我之前不知道镜像字形属性,现在已经相应地修改了答案。 - Joni

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接