正确地按字节数拆分Unicode字符串

Question

正确地按字节数拆分Unicode字符串

3

我希望将Unicode字符串分割为最大255字节字符，并将结果作为Unicode返回：

# s = arbitrary-length-unicode-string
s.encode('utf-8')[:255].decode('utf-8')

这个片段的问题在于，如果255字节字符是2字节unicode字符的一部分，我会遇到错误：

UnicodeDecodeError: 'utf8' codec can't decode byte 0xd0 in position 254: unexpected end of data

即使我处理了错误，也会在字符串末尾得到不需要的垃圾信息。

如何更优雅地解决这个问题？

- theta

2

我之前见过这个完全相同的问题被回答过，让我找到重复的。 - Martijn Pieters

1

你是正确的。在这里：https://dev59.com/Zm025IYBdhLWcg3wX070 - theta

@theta：那就更容易了。:-P - Martijn Pieters

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Mark Ransom · Accepted Answer

UTF-8的一个非常好的特性是，尾随字节可以很容易地与起始字节区分开来。只需向后工作，直到删除了一个起始字节即可。

trunc_s = s.encode('utf-8')[:256]
if len(trunc_s) > 255:
    final = -1
    while ord(trunc_s[final]) & 0xc0 == 0x80:
        final -= 1
    trunc_s = trunc_s[:final]
trunc_s = trunc_s.decode('utf-8')

编辑：也请查看被标识为重复的问题中的答案。