检查一个字符串是否以UTF-8编码

Question

检查一个字符串是否以UTF-8编码

4

function seems_utf8($str) {
 $length = strlen($str);
 for ($i=0; $i < $length; $i++) {
  $c = ord($str[$i]);
  if ($c < 0x80) $n = 0; # 0bbbbbbb
  elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
  elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
  elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
  elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
  elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
  else return false; # Does not match any model
  for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
   if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
    return false;
  }
 }
 return true;
}

我从WordPress得到了这段代码，虽然我对此并不了解，但我想知道该函数究竟发生了什么。

如果有人知道，请帮帮我？

我需要对上述代码有清晰的理解。逐行解释会更有帮助。

- coderex

4个回答

8

该算法基本上是检查字节序列是否符合您可以在维基百科文章中看到的模式。 for循环是为了遍历$str中的所有字节。ord获取当前字节的十进制数。然后测试该数字的一些属性。

如果数字小于128（0x80），则它是单字节字符。如果它等于或大于128，则会检查多字节字符的长度。这可以通过多字节字符序列的第一个字符来完成。如果第一个字节以110xxxxx开头，则它是双字节字符；1110xxxx，它是三字节字符，依此类推。

我认为最神秘的部分是诸如($c & 0xE0) == 0xC0之类的表达式。这是为了检查二进制格式的数字是否具有某些特定模式。我将尝试在相同的示例上解释其工作原理。

由于我们测试该模式的所有数字都等于或大于0x80，因此第一个字节始终为1，因此该模式限制为至少1xxxxxxxx。如果我们使用11100000（0xE0）进行按位与比较，则会得到以下结果：

  1xxxxxxx
& 11100000
= 1xx00000

因此，从右边开始读取（索引从0开始），位置5和6的位取决于当前数字。为了使其等于11000000，第5位必须是0，第6位必须是1：

  1xxxxxxx
& 11100000
≟ 11000000
   ↓↓
→ 110xxxxx

那意味着我们数字的其他位可以是任意的：110xxxxx。这正是维基百科文章中预测两字节字符第一个字节的模式。

最后，内部的for循环用于检查多字节字符后面的字节的合理性。它们都必须以10xxxxxx开头。

- Gumbo

8

如果您对UTF-8有一定了解，那么它的实现就非常简单。

function seems_utf8($str) {
 # get length, for utf8 this means bytes and not characters
 $length = strlen($str);  

 # we need to check each byte in the string
 for ($i=0; $i < $length; $i++) {

  # get the byte code 0-255 of the i-th byte
  $c = ord($str[$i]);

  # utf8 characters can take 1-6 bytes, how much
  # exactly is decoded in the first character if 
  # it has a character code >= 128 (highest bit set).
  # For all <= 127 the ASCII is the same as UTF8.
  # The number of bytes per character is stored in 
  # the highest bits of the first byte of the UTF8 
  # character. The bit pattern that must be matched
  # for the different length are shown as comment.
  #
  # So $n will hold the number of additonal characters

  if ($c < 0x80) $n = 0; # 0bbbbbbb
  elseif (($c & 0xE0) == 0xC0) $n=1; # 110bbbbb
  elseif (($c & 0xF0) == 0xE0) $n=2; # 1110bbbb
  elseif (($c & 0xF8) == 0xF0) $n=3; # 11110bbb
  elseif (($c & 0xFC) == 0xF8) $n=4; # 111110bb
  elseif (($c & 0xFE) == 0xFC) $n=5; # 1111110b
  else return false; # Does not match any model

  # the code now checks the following additional bytes
  # First if expression checks that the byte is really inside the
  # string and not running over the string end.
  # The second expression just check that the highest two bits of all 
  # additonal bytes are always 1 and 0 (hexadecimal 0x80)
  # which is a requirement for all additional UTF-8 bytes

  for ($j=0; $j<$n; $j++) { # n bytes matching 10bbbbbb follow ?
   if ((++$i == $length) || ((ord($str[$i]) & 0xC0) != 0x80))
    return false;
  }
 }
 return true;
}

顺便说一下，在PHP上，我认为这比C函数慢了50-100倍，所以您不应该在长字符串和生产系统上使用它。

- Lothar

0

我在阅读这篇文章时遇到了类似的问题。mb_detect_encoding 显示为 utf-8，但 mb_check_encoding 返回 false...

对于我来说，解决方法是：

 $cur_encoding = mb_detect_encoding($in_str) ;
  if($cur_encoding == "UTF-8" && mb_check_encoding($in_str,"UTF-8"))
    return $in_str;
  else
    return utf8_encode($in_str);

从这里得到的：

http://board.phpbuilder.com/showthread.php?10368156-mb_check_encoding%28-in_str-quot-UTF-8-quot-%29-return-different-results

抱歉无法正确发布链接....

- womd

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- bisko · Accepted Answer

我有两种方法来检查字符串是否为 utf-8 编码（具体使用取决于情况）：

mb_internal_encoding('UTF-8'); // always needed before mb_ functions, check note below
if (mb_strlen($string) != strlen($string)) {
 /// not single byte
}

-- OR --

if (preg_match('!\S!u', $string)) {
 // utf8
}

关于 mb_internal_encoding - 由于 PHP 中一些我不知道的 bug（版本 5.3-（未在 5.3 上测试）），将编码作为参数传递给 mb_ 函数无效，需要在使用任何 mb_ 函数之前设置内部编码。