ElementTree Unicode编码/解码错误

3

我正在一个项目中需要增强一些XML并将其存储在文件中。遇到的问题是我一直收到以下错误:

Traceback (most recent call last):
  File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 193, in parse_references
    outputXML = ET.tostring(root, encoding='utf8', method='xml')
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 939, in _serialize_xml
    _serialize_xml(write, e, encoding, qnames, None)
 ECLI:NL:RVS:2012:BY1564
 File "C:\Python27\lib\xml\etree\ElementTree.py", line 937, in _serialize_xml
    write(_escape_cdata(text, encoding))
  File "C:\Python27\lib\xml\etree\ElementTree.py", line 1073, in _escape_cdata
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 80: ordinal not in range(128)

该错误是由以下原因引起的:

outputXML = ET.tostring(root, encoding='utf8', method='xml')

寻找解决此问题的方法时,我发现有几个建议说我应该在函数中添加.decode('utf-8'),但是这会导致写入函数产生编码错误(首先它进行了解码),因此这种方法行不通...
编码错误:
Traceback (most recent call last):
  File "C:\Python27\lib\multiprocessing\process.py", line 258, in _bootstrap
    self.run()
  File "C:\Python27\lib\multiprocessing\process.py", line 114, in run
    self._target(*self._args, **self._kwargs)
  File "C:\Users\Bart\Dropbox\Studie\2013-2014\BSc-KI\cite_parser\parser.py", line 197, in parse_references
    myfile.write(outputXML)
UnicodeEncodeError: 'ascii' codec can't encode character u'\xeb' in position 13559: ordinal not in range(128)

这是由以下代码生成的:
outputXML = ET.tostring(root, encoding='utf8', method='xml').decode('utf-8')

源代码(或至少是相关部分):


# URL encodes the parameters
encoded_parameters = urllib.urlencode({'id':ecli})

# Opens XML file
feed = urllib2.urlopen("http://data.rechtspraak.nl/uitspraken/content?"+encoded_parameters, timeout = 3)

# Parses the XML
ecliFile = ET.parse(feed)

# Fetches root element of current tree
root = ecliFile.getroot()

# Write the XML to a file without any extra indents or newlines
outputXML = ET.tostring(root, encoding='utf8', method='xml')

# Write the XML to the file
with open(file, "w") as myfile:
    myfile.write(outputXML)

最后但并非最不重要的,这是一个XML样例的链接:http://data.rechtspraak.nl/uitspraken/content?id=ECLI:NL:RVS:2012:BY1542


异常的完整回溯是什么?我敢打赌不是 ElementTree 本身触发了它。 - Martijn Pieters
我刚刚为这两个异常添加了完整的跟踪信息 :) - B8vrede
我无法重现这个问题,至少在Python 2.7.6上不行。 - Martijn Pieters
UnicodeDecodeError 不应该出现;它意味着树中存在字节字符串数据,而不是预期的 Unicode。您是否操作了树,添加了元素?如果是这样,请确保添加Unicode字符串,而不是字节字符串。 - Martijn Pieters
谢谢Martijn,我想我找到了问题所在,确实是由于将非Unicode元素添加到树中导致的结果。我应该删除这个问题吗? - B8vrede
1
我已经把它作为答案了;对于遇到这种异常的其他人可能会有帮助。 - Martijn Pieters
1个回答

6

这个异常是由于一个字节字符串值引起的。

在回溯中,text应该是一个Unicode值,但如果它是一个普通的字节字符串,Python会隐式地先将其解码(使用ASCII编解码器)成Unicode,以便您可以再次进行编码

正是这个解码失败了。

因为您没有实际展示您插入XML树的内容,所以很难告诉您需要修复什么,除了确保您始终在插入文本时使用Unicode值。

演示:

>>> root.attrib['oops'] = u'Data with non-ASCII codepoints \u2014 (em dash)'.encode('utf8')
>>> ET.tostring(root, encoding='utf8', method='xml')
Traceback (most recent call last):
  File "<console>", line 1, in <module>
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 1126, in tostring
    ElementTree(element).write(file, encoding, method=method)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 820, in write
    serialize(write, self._root, encoding, qnames, namespaces)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 932, in _serialize_xml
    v = _escape_attrib(v, encoding)
  File "/Users/mj/Development/Library/buildout.python/parts/opt/lib/python2.7/xml/etree/ElementTree.py", line 1090, in _escape_attrib
    return text.encode(encoding, "xmlcharrefreplace")
UnicodeDecodeError: 'ascii' codec can't decode byte 0xe2 in position 31: ordinal not in range(128)
>>> root.attrib['oops'] = u'Data with non-ASCII codepoints \u2014 (em dash)'
>>> ET.tostring(root, encoding='utf8', method='xml')
'<?xml version=\'1.0\' encoding=\'utf8\'?> ...'

设置一个包含ASCII范围外的字节的bytestring属性会触发异常;使用Unicode值可以确保结果能够被产生。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接