Unicode解码问题

Question

Unicode解码问题

4

这很有趣.. 我正在尝试从OpenStreetMap中读取地理位置查找数据。执行查询的代码看起来像这样

params = urllib.urlencode({'q': ",".join([e for e in full_address]), 'format': "json", "addressdetails" : "1"})
query = "http://nominatim.openstreetmap.org/search?%s" % params
print query
time.sleep(5)
response = json.loads(unicode(urllib.urlopen(query).read(), "UTF-8"), encoding="UTF-8")
print response

苏黎世的查询在UTF-8数据上被正确地进行了URL编码。这里没有什么奇怪的事情。

http://nominatim.openstreetmap.org/search?q=Z%C3%BCrich%2CSWITZERLAND&addressdetails=1&format=json

当我打印响应时，带有umlaut的u被编码为Latin1（0xFC）。

[{u'display_name': u'Z\xfcrich, Bezirk Z\xfcrich, Z\xfcrich, Schweiz, Europe', u'place_id': 588094, u'lon': 8.540443

但这是无稽之谈，因为OpenStreetMap以UTF-8的格式返回JSON数据。

Connecting to nominatim.openstreetmap.org (nominatim.openstreetmap.org)|128.40.168.106|:80... connected.
HTTP request sent, awaiting response... 
  HTTP/1.1 200 OK
  Date: Wed, 26 Jan 2011 13:48:33 GMT
  Server: Apache/2.2.14 (Ubuntu)
  Content-Location: search.php
  Vary: negotiate
  TCN: choice
  X-Powered-By: PHP/5.3.2-1ubuntu4.7
  Access-Control-Allow-Origin: *
  Content-Length: 3342
  Keep-Alive: timeout=15, max=100
  Connection: Keep-Alive
  Content-Type: application/json; charset=UTF-8
Length: 3342 (3.3K) [application/json]

文件内容也证实了这一点，然后我明确地表示在读取和解析json时都使用UTF-8。

这里发生了什么？

编辑：显然是json.loads出了问题。

- Stefano Borini

2个回答

1

输出结果很好。每当您在控制台上打印数据时，Python 仅在打印实际字符串时对 Unicode 进行编码。如果您打印 Unicode 列表，则每个 Unicode 字符串都会显示在控制台上，就像它的 repr() 一样：

>>> a=u'á'
>>> a
u'\xe1'
>>> print a
á
>>> [a]
[u'\xe1']
>>> print [a]
[u'\xe1']

- vz0

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- etarion · Accepted Answer

当我打印响应时，带重音符号的u被编码为latin1（0xFC）。

你只是误解了输出。它是一个Unicode字符串（你可以通过前缀中的u来判断），没有编码“附加”——\xFC表示它是具有编号0xFC的码点，这恰好是U-Umlaut（请参见http://www.fileformat.info/info/unicode/char/fc/index.htm）。之所以会出现这种情况，是因为前256个Unicode码点的编号与Latin1编码相吻合。

简而言之，你做得很对——你有一个具有正确内容的Unicode对象（与编码无关），当你在某个地方使用该内容进行输出时，你可以选择自己想要的编码方式，方法是unicodestr.encode("utf-8")或使用codecs，参见http://docs.python.org/howto/unicode.html#reading-and-writing-unicode-data。