无效的UTF-8字节

Question

无效的UTF-8字节

5

Not all sequences of bytes are valid UTF-8. A UTF-8 decoder should be prepared for:

1. the red invalid bytes in the above table
2. an unexpected continuation byte
3. a start byte not followed by enough continuation bytes
4. an Overlong Encoding as described above
5. A 4-byte sequence (starting with 0xF4) that decodes to a value greater than U+10FFFF

根据Codepage布局，0xC0和0xC1是无效的，不能出现在有效的UTF-8序列中。以下是我对CodePoints 0xC0和0xC1的内容：

Byte 2   Byte 1      Num   Char
11000011 10000000    192   À
11000011 10000001    193   Á

这些字节序列对应着字符，但实际上不应该对应。是我做错了吗？

- Hamid Sarfraz

2

你混淆了代码点和代码单元。 - nwellnhof

这两行包含 {xc3+x80} -> xc0 -> 192 和 {xc3+x81} -> xc1 -> 193（你似乎交换了 byte2 和 byte1）。 - wildplasser

@wildplasser，你可以从比特序列中猜测。 - Hamid Sarfraz

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- deceze · Accepted Answer

您只是混淆了术语：

代码点 U+00C0 是字符“À”，U+00C1 是“Á”。
在UTF-8中，它们分别是字节序列 C3 80 和 C3 81。

字节 C0 和 C1 不应出现在UTF-8编码中。

代码点 独立于字节表示字符。字节就是字节。