Python，将4字节字符转换以避免MySQL错误“Incorrect string value:”

Question

Python，将4字节字符转换以避免MySQL错误“Incorrect string value:”

pythonmysqlutf-8character-encodingpython-unicode

7

我需要在Python中将一个4字节的字符转换为其他字符。这是为了插入到我的utf-8 mysql数据库中而不出现错误，例如："在第1行的'line'列中有不正确的字符串值：'\xF0\x9F\x94\x8E'"

>>> import re
>>> highpoints = re.compile(u'[\U00010000-\U0010ffff]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '

然而，我遇到了与评论中的用户相同的错误，"...bad character range.."。显然，这是因为我的Python是UCS-2（而不是UCS-4）构建的。但是我不清楚应该怎么做？

- user984003

如果在 MySql 中使用 utf8mb4 字符集，这个问题还存在吗？ - Janne Karila

不确定。很遗憾我无法更改数据库的字符集。 - user984003

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

在UCS-2编码中，Python对于超过\U0000ffff码点的每个Unicode字符会内部使用2个代码单元。正则表达式需要与之一起工作，因此您需要使用以下正则表达式来匹配它们：

highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

这个正则表达式匹配使用UTF-16代理对编码的任何代码点（请参见UTF-16 Code points U+10000 to U+10FFFF）。

为了使其兼容Python UCS-2和UCS-4版本，您可以使用try:/except 来使用其中之一：

try:
    highpoints = re.compile(u'[\U00010000-\U0010ffff]')
except re.error:
    # UCS-2 build
    highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')

在 UCS-2 版本的 Python 上进行演示：

>>> import re
>>> highpoints = re.compile(u'[\uD800-\uDBFF][\uDC00-\uDFFF]')
>>> example = u'Some example text with a sleepy face: \U0001f62a'
>>> highpoints.sub(u'', example)
u'Some example text with a sleepy face: '