如何将Unicode转换为Unicode转义文本

Question

如何将Unicode转换为Unicode转义文本

3

我正在加载一个包含许多unicode字符的文件（例如\xe9\x87\x8b）。我想在Python中将这些字符转换为其转义的unicode形式（\u91cb）。我在StackOverflow上找到了几个类似的问题，包括这个Evaluate UTF-8 literal escape sequences in a string in Python3，它几乎完全符合我的要求，但我不知道如何保存数据。

例如：输入文件：

\xe9\x87\x8b

Python脚本：

file = open("input.txt", "r")
text = file.read()
file.close()
encoded = text.encode().decode('unicode-escape').encode('latin1').decode('utf-8')
file = open("output.txt", "w")
file.write(encoded) # fails with a unicode exception
file.close()

期望的输出文件：

\u91cb

- fallaciousreasoning

print(open('input.txt', 'rb').read()) 是什么？它是 b'\xe9\x87\x8b' 还是 b'\\xe9\\x87\\x8b'？ - jfs

3个回答

3

\xe9\x87\x8b 不是一个 Unicode 字符。它看起来像是一个用 utf-8 字符编码编码的代表 釋 Unicode 字符的字节串。\u91cb 是 Python 源代码（或 JSON 格式）中表示 釋 字符的代表。不要混淆文本表示和字符本身：

>>> b"\xe9\x87\x8b".decode('utf-8')
u'\u91cb' # repr()
>>> print(b"\xe9\x87\x8b".decode('utf-8'))
釋
>>> import unicodedata
>>> unicodedata.name(b"\xe9\x87\x8b".decode('utf-8'))
'CJK UNIFIED IDEOGRAPH-91CB'

要从文件中读取以utf-8编码的文本，需要显式指定字符编码：

with open('input.txt', encoding='utf-8') as file:
    unicode_text = file.read()

保存Unicode文本到文件中同样是一样的操作：

with open('output.txt', 'w', encoding='utf-8') as file:
    file.write(unicode_text)

如果您省略了显式的encoding参数，则会使用locale.getpreferredencoding(False)，如果它不对应于用于保存文件的实际字符编码，则可能会产生乱码问题。

如果您的输入文件中包含字面上的\xe9（4个字符），那么您应该修复生成此类字符的软件。如果需要使用'unicode-escape'，则说明存在问题。

- jfs

1

看起来您的输入文件是UTF-8编码，因此在打开文件时指定UTF-8编码（假定根据您的参考资料使用Python3）：

with open("input.txt", "r", encoding='utf8') as f:
    text = f.read()

text将包含文件内容作为str（即Unicode字符串）。现在，您可以通过指定encoding ='unicode-escape'，将其以Unicode转义形式直接写入文件：

with open('output.txt', 'w', encoding='unicode-escape') as f:
    f.write(text)

您的文件内容现在将包含Unicode转义的文字：

$ cat output.txt
\u91cb

- mhawke

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- falsetru · Accepted Answer

你需要使用unicode-escape编码再次进行编码。

>>> br'\xe9\x87\x8b'.decode('unicode-escape').encode('latin1').decode('utf-8')
'釋'
>>> _.encode('unicode-escape')
b'\\u91cb'

修改了代码（使用二进制模式以减少不必要的编解码）

with open("input.txt", "rb") as f:
    text = f.read().rstrip()  # rstrip to remove trailing spaces
decoded = text.decode('unicode-escape').encode('latin1').decode('utf-8')
with open("output.txt", "wb") as f:
    f.write(decoded.encode('unicode-escape'))

http://asciinema.org/a/797ruy4u5gd1vsv8pplzlb6kq