Python中unicode()和encode()函数的用法

Question

Python中unicode()和encode()函数的用法

85

我遇到了一个问题，涉及到路径变量的编码以及将其插入到SQLite数据库中。我尝试使用encode("utf-8")函数来解决这个问题，但没有成功。然后我使用了unicode()函数，它给我返回了unicode类型。

print type(path)                  # <type 'unicode'>
path = path.replace("one", "two") # <type 'str'>
path = path.encode("utf-8")       # <type 'str'> strange
path = unicode(path)              # <type 'unicode'>

最终我获得了unicode类型，但当path变量的类型是str时，仍然存在相同的错误。

sqlite3.ProgrammingError：除非您使用可以解释8位bytestrings（例如text_factory = str）的text_factory，否则不得使用8位bytestrings。强烈建议您改为使用Unicode字符串。

你能帮我解决这个错误并解释encode("utf-8")和unicode()函数的正确用法吗？我经常与它作斗争。

这个execute()语句引发了错误：

cur.execute("update docs set path = :fullFilePath where path = :path", locals())

我忘记修改fullFilePath变量的编码，这个问题和之前相同，但是我现在很困惑。我应该只使用unicode()还是encode("utf-8")或者两者都用呢？

我不能使用

fullFilePath = unicode(fullFilePath.encode("utf-8"))

因为它引发了这个错误：

UnicodeDecodeError：'ascii'编解码器无法解码位置32中的字节0xc5：序数不在范围（128）内

Python版本为2.7.2

- xralf

2

你的问题已经有了确切的答案：https://dev59.com/neo6XIcBkEYKwwoYTzAw - garnertb

你已经将这两个使用的变量都转换为 unicode 了吗？ - newtover

2

学习Python 3如何处理文本和数据真的帮助我理解了一切。然后，将这些知识应用到Python 2上就变得很容易了。 - Oleh Prypin

这是一篇关于Python中Unicode的精彩演讲的幻灯片 -- 链接 - bachr

3个回答

88

你正在错误地使用encode("utf-8")。Python 字节串（str类型）具有编码，而Unicode则不具备。你可以使用uni.encode(encoding)将Unicode字符串转换为Python字节串，也可以使用s.decode(encoding)（或等效的unicode(s, encoding)）将字节串转换为Unicode字符串。

如果fullFilePath和path目前是str类型，则应确定它们的编码方式。例如，如果当前编码方式为 utf-8，则应使用以下代码：

path = path.decode('utf-8')
fullFilePath = fullFilePath.decode('utf-8')

如果这不能解决问题，实际问题可能是在execute()调用中你没有使用Unicode字符串，请尝试更改为以下内容：

cur.execute(u"update docs set path = :fullFilePath where path = :path", locals())

- Andrew Clark

这个语句 fullFilePath = fullFilePath.decode("utf-8") 仍然会引发错误 UnicodeEncodeError: 'ascii' codec can't encode characters in position 32-34: ordinal not in range(128)。fullFilePath 是由类型为 str 的字符串和从数据库表的 text 列中取出的字符串组合而成，应该采用 utf-8 编码。 - xralf

根据这个链接，它可以是UTF-8、UTF-16BE或UTF-16LE编码。我能以某种方式找出它吗？ - xralf

@xralf，如果您正在组合不同的str对象，则可能会混合编码。您能展示一下print repr(fullFilePath)的结果吗？ - Andrew Clark

我只能在 decode() 调用之前显示它。有问题的字符是 \u0161 和 \u0165。 - xralf

@xralf - 所以它已经是Unicode了吗？尝试将执行调用更改为Unicode：cur.execute(u"update docs set path = :fullFilePath where path = :path", locals()) - Andrew Clark

1

在从 shell 运行脚本之前，请确保您已正确设置了区域设置，例如：

$ locale -a | grep "^en_.\+UTF-8"
en_GB.UTF-8
en_US.UTF-8
$ export LC_ALL=en_GB.UTF-8
$ export LANG=en_GB.UTF-8

文档：man locale，man setlocale。

- kenorb

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- newtover · Accepted Answer

str 是以字节为单位的文本表示形式，unicode 是以字符为单位的文本表示形式。

您可以将文本从字节解码为Unicode，并使用某些编码将Unicode编码为字节。

也就是说：

>>> 'abc'.decode('utf-8')  # str to unicode
u'abc'
>>> u'abc'.encode('utf-8') # unicode to str
'abc'

更新于2020年9月：该答案是在主要使用Python 2的时候编写的，在Python 3中，str被重命名为bytes，而unicode被重命名为str。

>>> b'abc'.decode('utf-8') # bytes to str
'abc'
>>> 'abc'.encode('utf-8'). # str to bytes
b'abc'