在Python中，如何替换字符串中的所有非UTF-8字符？

Question

在Python中，如何替换字符串中的所有非UTF-8字符？

pythonmysqlencodingutf-8

5

更新：真正的问题是MySQL utf8不支持四字节UTF-8字符。

关于这个主题有几个问题，但除了这个问题也许没有一个问题与我的问题完全相同，其中被接受的答案对我无效。

我用MySQLdb模块编写Python代码，并想将一些文本放入MySQL数据库中。该数据库已配置为UTF-8，但文本偶尔包含非UTF-8的四字节UTF-8字符。

数据库修改的Python代码如下：

connection = MySQLdb.connect(
    'localhost',
    'root',
    '',
    'mydatabase',
    charset='utf8',
    use_unicode=True)
cursor = connection.cursor()
cursor.execute(
    'update mytable set entryContent=%s where entryName=%s',
    (entryContent, entryName))
connection.commit()

目前它会产生以下警告：

./myapp.py:233: Warning: Invalid utf8 character string: 'F09286'
  (entry, word))
./myapp.py:233: Warning: Incorrect string value: '\xF0\x92\x86\xB7\xF0\x92...' for column 'entry' at row 1
  (entryname, entrycontent))

当我使用mysql命令行客户端查看实际进入数据库的内容时，我发现内容在第一个~~非UTF-8~~四字节UTF-8字符处被截断。

我不关心保留~~非UTF-8~~四字节UTF-8字符，所以我想做的就是用其他有效的UTF-8字符替换所有~~非UTF-8~~四字节UTF-8字符，这样我就可以将文本放入数据库中。

- davidrmcharles

entry.decode().encode('ascii', 'replace') - Peter Wood

@Peter Wood: '同源词包括赫梯语 ‎(lāman)'。decode().encode('ascii', 'replace') 会产生 UnicodeDecodeError: 'ascii' codec can't decode byte 0xf0 in position 25: ordinal not in range(128) 的错误。 - davidrmcharles

抱歉，'Cognates include Hittite ‎(lāman)'.decode('utf-8').encode('ascii', 'replace')，会得到 'Cognates include Hittite ???????? ?(l?man)'。 - Peter Wood

3个回答

3

事实证明，问题不在于我向MySQL提供了非UTF-8字符，而在于我给MySQL提供了四字节UTF-8字符，而它仅支持三字节（或更少）的UTF-8字符（根据这份文档）。

这个解决方案保留了所有支持的UTF-8字符，并将不支持的UTF-8字符转换为“?”：

>>> print ''.join([c if len(c.encode('utf-8')) < 4 else '?' for c in u'Cognates include Hittite  ‎(lāman)'])
Cognates include Hittite ???? ‎(lāman)

请注意 'ā' 已被保留
请注意 '' 变成了 '????'

我可以将这个字符串放入MySQL中，而不会出现上述警告（和不良截断）。

- davidrmcharles

是的，真正的问题是我不知道utf8mb4。 - davidrmcharles

2

你能使用正则表达式来删除非ASCII字符吗？使用你在评论中提供的例子：

>>> entry = 'Cognates include Hittite  ‎(lāman)'
>>> entry = ''.join([char if ord(char) < 128 else '' for char in entry])
>>> print entry
Cognates include Hittite  (lman)

这是对一个不同问题的此答案的轻微变化。

- cyril

我想保留（三字节）UTF-8字符，但是这个答案会删除很多UTF-8字符。 - davidrmcharles

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Alastair McCormack · Accepted Answer

您需要将表的编码设置为utf8mb4，以支持4字节的UTF-8编码 - https://dev.mysql.com/doc/refman/5.5/en/charset-unicode-utf8mb4.html

此外，MySQL驱动程序支持Unicode字符串，因此您应该传递Unicode以使您的代码摆脱编码特定性：

例如：

cursor.execute(u'update mytable set entryContent=%s where entryName=%s',
(entryContent.decode("utf-8"), entryName.decode("utf-8")))

理想情况下，在您的代码中首次接收到entryContent和entryName时，应该已经将它们解码为Unicode。例如，在打开文件或从网络接收时。