获取多字节UTF-8序列的长度

Question

获取多字节UTF-8序列的长度

cutf-8

3

我正在解析一些UTF-8文本，但只对ASCII范围内的字符感兴趣，也就是说，我可以跳过多字节序列。

我可以很容易地检测到序列的开头，因为符号位被设置了，所以char值为< 0。但我怎样才能知道序列中有多少字节，以便我能够跳过它？

我不需要执行任何验证，即我可以假设输入的是有效的UTF-8。

- CodeClown42

请记住，char 可以根据编译器的不同实现为有符号或无符号。如果一个字符 ch 的高位被设置了，这可能意味着 ch < 0 或者意味着 ch >= 128。 - Adrian McCarthy

2个回答

5

虽然Deduplicator的答案更适用于跳过多字节序列的特定目的，但如果需要获取每个这样字符的长度，请将第一个字节传递给此函数：

int getUTF8SequenceLength (unsigned char firstPoint) {
    firstPoint >>= 4;
    firstPoint &= 7;
    if (firstPoint == 4) return 2;
    return firstPoint - 3;
}

这将返回序列的总长度，包括第一个字节。我在此使用无符号字符值作为firstPoint参数以便清晰易懂，但请注意，如果参数是有符号字符，此函数的工作方式将完全相同。

解释如下：

UTF-8 uses bits 5, 6, and 7 in the first byte of a sequence to indicate the remaining length. If all three are set, the sequence is 3 additional bytes. If only the first of these from the left (the 7th bit) is set, the sequence is 1 additional byte. If the first two from the left are set, the sequence is 2 additional bytes. Hence, we want to examine these three bits (the value here is just an example):
```
 11110111
  ^^^
```
The value is shifted down by 4 then AND'd with 7. This leaves only the 1st, 2nd, and 3rd bits from the right as the only possible ones set. The value of these bits are 1, 2, and 4 respectively.
```
00000111
     ^^^ 
```
If the value is now 4, we know only the first bit from the left (of the three we are considering) is set and can return 2.
After this, the value is either 7, meaning all three bits are set, so the sequence is 4 bytes in total, or 6, meaning the first two from the left are set so the sequence is 3 bytes in total.

这涵盖了以UTF-8表示的有效Unicode字符范围。

- CodeClown42

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Deduplicator · Accepted Answer

5

只需删除所有不是有效ASCII字符的字节，不要尝试以任何方式解释大于127的字节。只要您没有任何基于ASCII范围内的基本字符的组合序列，这种方法就有效。对于这些，您需要解释码点本身。

- Deduplicator

这不是一个很好的想法。在这种情况下，您可能会包含UTF-8序列中的字符。 - nothrow

1

@Yossarian：请举个例子。据我所知，UTF-8明确地使您的情况不可能。 - Deduplicator

1

@Yossarian 所有由超过1个字节（2、3、4）组成的UTF-8序列仅由最高位为1的字节组成。 - chux - Reinstate Monica

@chux：使用NFD或带有多个变音符号的字符，是可以的。尽管在现实生活中很少使用。 - Deduplicator

1

@Deduplicator 我理解你对Unicode等价性的关注。如果一个'é'由Unicode代码点x00e9或者'e'(x0065)和'◌́'(x0301)组成。所以，如果我们只关心非组合代码点，你的解决方案就可以很好地工作。我认为它也符合OP的目标。仍然最喜欢你的解决方案。 - chux - Reinstate Monica

1

@goldilocks和Yossarian，需要根据UTF-8序列的Unicode等价性修改我的断言。一个由多个字节（2,3,4）组成且未合并的UTF-8序列只包含设置了最高有效位（MSBit）的字节。 - chux - Reinstate Monica