在字符串中处理十进制转义符

Question

在字符串中处理十进制转义符

pythonpython-3.xescaping

3

我有一个每行一个字符串的文件，其中非ASCII字符已使用十进制代码点进行转义。一个例子是：

mj\\195\\164ger

（双反斜杠在文件中与输出一样）

我想要处理这个字符串以产生

mjäger

。通常情况下，Python使用十六进制转义而不是十进制转义（例如，上面的字符串将被编写为mj\xc3\xa4ger，可以由Python解码：

>>> by=b'mj\xc3\xa4ger'
>>> by.decode('utf-8')
'mjäger'

然而，Python 并不会立即识别十进制转义符。

我编写了一个方法来正确操作字符串以生成十六进制转义符，但这些转义符本身也被转义了。我该如何让 Python 处理这些十六进制转义符以创建最终的字符串？

import re

hexconst=["0","1","2","3","4","5","6","7","8","9","a","b","c","d","e","f"]
escapes=re.compile(r"\\[0-9]{3}")
def dec2hex(matchobj):
    dec=matchobj.group(0)
    dec=int(dec[1:])
    digit1=dec//16 #integer division
    digit2=dec%16 
    hex="\\x" + hexconst[digit1] + hexconst[digit2]
    return hex

line=r'mj\195\164ger'
print(escapes.sub(dec2hex,line)) #Outputs mj\xc3\xa4ger

我需要做哪个最后一步才能将上面的输出从mj\xc3\xa4ger转换为mjäger？谢谢！

- computermacgyver

print(escapes.sub(dec2hex,line)) 的输出为 mj\xc3\xa4ger，但在内存中它被存储为 mj\xc3\xa4ger。我会删除我的答案，因为它与 Tim 的类似。 - WKPlus

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Tim Pietzcker · Answer 1

这要简单得多。re.sub()可以接受一个回调函数作为参数，而不是替换字符串：

>>> import re
>>> line=r'mj\195\164ger'
>>> def replace(match):
...     return chr(int(match.group(1)))

>>> regex = re.compile(r"\\(\d{1,3})")
>>> new = regex.sub(replace, line)
>>> new
'mj\xc3\xa4ger'
>>> print new
mjäger

在Python 3中，字符串是Unicode字符串，因此如果您正在使用编码输入（如UTF-8编码内容），则需要使用正确的类型，即bytes：

>>> line = rb'mj\195\164ger'
>>> regex = re.compile(rb"\\(\d{1,3})")
>>> def replace(match):
...     return int(match.group(1)).to_bytes(1, byteorder="big")

>>> new = regex.sub(replace, line)
>>> new
b'mj\xc3\xa4ger'
>>> print(new.decode("utf-8"))
mjäger