ElementTree Unicode编码/解码错误

Question

ElementTree Unicode编码/解码错误

3

我正在一个项目中需要增强一些XML并将其存储在文件中。遇到的问题是我一直收到以下错误：

Traceback (most recent call last):
  File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 193, in parse_references
    outputXML = ET.tostring(root, encoding='utf8', method='xml')
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
 ECLI:NL:RVS:2012:BY1564
 File "C:\Python27\lib\xml\etree\ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 80: ordinal not in range(128)

该错误是由以下原因引起的:

outputXML = ET.tostring(root, encoding='utf8', method='xml')

寻找解决此问题的方法时，我发现有几个建议说我应该在函数中添加.decode('utf-8')，但是这会导致写入函数产生编码错误（首先它进行了解码），因此这种方法行不通...

编码错误：

Traceback (most recent call last):
  File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 197, in parse_references
    myfile.write(outputXML)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 13559: ordinal not in range(128)

这是由以下代码生成的：

outputXML = ET.tostring(root, encoding='utf8', method='xml').decode('utf-8')

源代码（或至少是相关部分）：

# URL encodes the parameters
encoded_parameters = urllib.urlencode({'id':ecli})

# Opens XML file
feed = urllib2.urlopen("http://data.rechtspraak.nl/uitspraken/content?"+encoded_parameters, timeout = 3)

# Parses the XML
ecliFile = ET.parse(feed)

# Fetches root element of current tree
root = ecliFile.getroot()

# Write the XML to a file without any extra indents or newlines
outputXML = ET.tostring(root, encoding='utf8', method='xml')

# Write the XML to the file
with open(file, "w") as myfile:
    myfile.write(outputXML)

最后但并非最不重要的，这是一个XML样例的链接：http://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RVS:2012:BY1542

- B8vrede

异常的完整回溯是什么？我敢打赌不是 ElementTree 本身触发了它。 - Martijn Pieters

我刚刚为这两个异常添加了完整的跟踪信息 :) - B8vrede

我无法重现这个问题，至少在Python 2.7.6上不行。 - Martijn Pieters

UnicodeDecodeError 不应该出现；它意味着树中存在字节字符串数据，而不是预期的 Unicode。您是否操作了树，添加了元素？如果是这样，请确保添加Unicode字符串，而不是字节字符串。 - Martijn Pieters

谢谢Martijn，我想我找到了问题所在，确实是由于将非Unicode元素添加到树中导致的结果。我应该删除这个问题吗？ - B8vrede

1

我已经把它作为答案了；对于遇到这种异常的其他人可能会有帮助。 - Martijn Pieters

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

这个异常是由于一个字节字符串值引起的。

在回溯中，text应该是一个Unicode值，但如果它是一个普通的字节字符串，Python会隐式地先将其解码（使用ASCII编解码器）成Unicode，以便您可以再次进行编码。

正是这个解码失败了。

因为您没有实际展示您插入XML树的内容，所以很难告诉您需要修复什么，除了确保您始终在插入文本时使用Unicode值。

演示：

>>> root.attrib['oops'] = u'Data with non-ASCII codepoints \u2014 (em dash)'.encode('utf8')
>>> ET.tostring(root, encoding='utf8', method='xml')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
    v = _escape_attrib(v, encoding)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 1090, in _escape_attrib
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 31: ordinal not in range(128)
>>> root.attrib['oops'] = u'Data with non-ASCII codepoints \u2014 (em dash)'
>>> ET.tostring(root, encoding='utf8', method='xml')
'<?xml version=\'1.0\' encoding=\'utf8\'?> ...'

设置一个包含ASCII范围外的字节的bytestring属性会触发异常；使用Unicode值可以确保结果能够被产生。