将文本识别为简体中文还是繁体中文

Question

将文本识别为简体中文还是繁体中文

phpunicodecjklanguage-detection

6

如何判断一个已知为中文且采用 UTF-8 编码的文本是简体还是繁体？

- philfreo

2个回答

2

由于big5和gb2312省略了许多Unicode中常用的变体，因此在translit和ignore模式之间依赖精确匹配的代码将在很多正常用例中失败：它将无法识别説話为繁体中文，尽管説是香港对說的一种常见变体，后者在big5中使用。

一个简单的解决方法是以模糊方式进行：

$test1 = iconv("UTF-8", "big5//IGNORE", $text);
$test2 = iconv("UTF-8", "gb2312//IGNORE", $text);
$len1 = mb_strlen($test1);
$len2 = mb_strlen($test2);
$len0 = mb_strlen($text) * 0.8; // threshold
if ($len1 > $len2 && $len1 > $len0) {
    return 'Likely Traditional';
}
if ($len2 > $len1 && $len2 > $len0) {
    return 'Likely Simplified';
}
return 'Could not identify';

- Henry

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mark Baker · Accepted Answer

我不确定这个方法是否有效，但我建议使用iconv尝试将字符集正确地转换。可以使用//TRANSLIT和//IGNORE两种方式进行相同的转换，并比较结果。如果两个结果匹配，则字符集转换没有遇到任何无法转换的字符，因此应该是匹配的。

$test1 = iconv("UTF-8", "big5//TRANSLIT", $text);
$test2 = iconv("UTF-8", "big5//IGNORE", $text);
if ($test1 == $test2) {
   echo 'traditional';
} else {
   $test3 = iconv("UTF-8", "gb2312//TRANSLIT", $text);
   $test4 = iconv("UTF-8", "gb2312//IGNORE", $text);
   if ($test3 == $test4) {
      echo 'simplified';
   } else {
      echo 'Failed to match either traditional or simplified';
   }
}