我正在解析一些UTF-8文本,但只对ASCII范围内的字符感兴趣,也就是说,我可以跳过多字节序列。
我可以很容易地检测到序列的开头,因为符号位被设置了,所以char
值为< 0。但我怎样才能知道序列中有多少字节,以便我能够跳过它?
我不需要执行任何验证,即我可以假设输入的是有效的UTF-8。
只需删除所有不是有效ASCII字符的字节,不要尝试以任何方式解释大于127的字节。 只要您没有任何基于ASCII范围内的基本字符的组合序列,这种方法就有效。 对于这些,您需要解释码点本身。
虽然Deduplicator的答案更适用于跳过多字节序列的特定目的,但如果需要获取每个这样字符的长度,请将第一个字节传递给此函数:
int getUTF8SequenceLength (unsigned char firstPoint) {
firstPoint >>= 4;
firstPoint &= 7;
if (firstPoint == 4) return 2;
return firstPoint - 3;
}
这将返回序列的总长度,包括第一个字节。我在此使用无符号字符值作为firstPoint
参数以便清晰易懂,但请注意,如果参数是有符号字符,此函数的工作方式将完全相同。
解释如下:
UTF-8 uses bits 5, 6, and 7 in the first byte of a sequence to indicate the remaining length. If all three are set, the sequence is 3 additional bytes. If only the first of these from the left (the 7th bit) is set, the sequence is 1 additional byte. If the first two from the left are set, the sequence is 2 additional bytes. Hence, we want to examine these three bits (the value here is just an example):
11110111
^^^
The value is shifted down by 4 then AND'd with 7. This leaves only the 1st, 2nd, and 3rd bits from the right as the only possible ones set. The value of these bits are 1, 2, and 4 respectively.
00000111
^^^
If the value is now 4, we know only the first bit from the left (of the three we are considering) is set and can return 2.
After this, the value is either 7, meaning all three bits are set, so the sequence is 4 bytes in total, or 6, meaning the first two from the left are set so the sequence is 3 bytes in total.
这涵盖了以UTF-8表示的有效Unicode字符范围。
char
可以根据编译器的不同实现为有符号或无符号。如果一个字符ch
的高位被设置了,这可能意味着ch < 0
或者意味着ch >= 128
。 - Adrian McCarthy