如何将UTF-8字节偏移量转换为UTF-8字符偏移量

Question

如何将UTF-8字节偏移量转换为UTF-8字符偏移量

3

我需要处理一个遗留工具的输出，该工具报告的是 utf-8 字节偏移而不是 utf-8 字符偏移。例如，对于七字节的 utf-8 字符串 'aβgδe' 中的 5 个字符，它会报告[0, 1, 3, 4, 6]，而不是[0, 1, 2, 3, 4]，因为希腊字母 'β' 和 'δ' 被编码为两个字节序列。（实际文本中可能还包含 3 字节和 4 字节的 utf-8 序列。）

有没有内置的 Python 函数可以将 utf-8 字节偏移转换为 utf-8 字符偏移呢？

- Nemo XXX

我不确定我理解得是否正确。为什么不使用b-strings呢？这样你就有了字节偏移量。然后对于Python，你只需使用一个字符串（这样你就有了“字符偏移量”，但是字符串不是UTF-8）。最终你可以在需要时进行解码/编码（以获得正确的索引[如果额外的CPU不是问题]）。否则，你可以构建一个偏移表，但是一次只编码一个字符（并检查长度）：一个简单的列表压缩（如果字符串不是巨大的[如书籍/大文件]，那么这种方法很好）。 - Giacomo Catenazzi

当我遇到类似的问题时，我没有找到其他方法，只能将字符串编码为UTF-8，然后创建一个字节到字符偏移表；在这里实现（https://github.com/lfurrer/bconv/blob/f7418a8fdb772ca1b086c52e6db57a2758b82c44/bconv/fmt/bioc.py#L581-L586）。 - lenz

@lenz 显然，没有绕过创建字节到字符偏移表的方式。你能否将你的代码添加为答案，以便我可以点赞它。 - Nemo XXX

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- lenz · Accepted Answer

我认为这方面没有内置或标准库实用程序，但您可以编写自己的小函数来创建字节偏移量到码点偏移量的映射。

朴素方法

import typing as t

def map_byte_to_codepoint_offset(text: str) -> t.Dict[int, int]:
    mapping = {}
    byte_offset = 0
    for codepoint_offset, character in enumerate(text):
        mapping[byte_offset] = codepoint_offset
        byte_offset += len(character.encode('utf8'))
    return mapping

让我们使用您的示例进行测试：

>>> text = 'aβgδe'
>>> byte_offsets = [0, 1, 3, 4, 6]
>>> mapping = map_byte_to_codepoint_offset(text)
>>> mapping
{0: 0, 1: 1, 3: 2, 4: 3, 6: 4}
>>> [mapping[o] for o in byte_offsets]
[0, 1, 2, 3, 4]

优化

我没有进行基准测试，但是对每个字符分别调用.encode()可能不是很高效。此外，我们只对编码的字符长度感兴趣，它只能取四个值之一，对应于每个连续的代码点范围。要获取这些范围，可以研究UTF-8编码规范，查找互联网上的信息，或在Python REPL中进行快速计算：

>>> import sys
>>> bins = {i: [] for i in (1, 2, 3, 4)}
>>> for codepoint in range(sys.maxunicode+1):
...     # 'surrogatepass' required to allow encoding surrogates in UTF-8
...     length = len(chr(codepoint).encode('utf8', errors='surrogatepass'))
...     bins[length].append(codepoint)
...
>>> for l, cps in bins.items():
...     print(f'{l}: {hex(min(cps))}..{hex(max(cps))}')
...
1: 0x0..0x7f
2: 0x80..0x7ff
3: 0x800..0xffff
4: 0x10000..0x10ffff

此外，朴素方法返回的映射存在间隙：如果我们查找一个在多字节字符中间的偏移量，我们将得到一个KeyError（例如，在上面的示例中没有键2）。为了避免这种情况，我们可以通过重复码位偏移量来填充间隙。由于生成的索引将是从0开始的连续整数，我们可以使用列表而不是字典进行映射。

TWOBYTES = 0x80
THREEBYTES = 0x800
FOURBYTES = 0x10000

def map_byte_to_codepoint_offset(text: str) -> t.List[int]:
    mapping = []
    for codepoint_offset, character in enumerate(text):
        mapping.append(codepoint_offset)
        codepoint = ord(character)
        for cue in (TWOBYTES, THREEBYTES, FOURBYTES):
            if codepoint >= cue:
                mapping.append(codepoint_offset)
            else:
                break
    return mapping

以之前的例子为例：

>>> mapping = map_byte_to_codepoint_offset(text)
>>> mapping
[0, 1, 1, 2, 3, 3, 4]
>>> [mapping[o] for o in byte_offsets]
[0, 1, 2, 3, 4]