转换XML非法字符为UTF8 - Python

Question

转换XML非法字符为UTF8 - Python

5

这里有一个XML和HTML字符引用列表：https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references。然而，在这个列表中，有些东西没有定义，但它们在旧的HTML脚本中被使用。当我处理http://www.d.umn.edu/~tpederse/data.html的Senseval-2 format (with fixes)数据集时，我遇到了以下单词，这些单词破坏了我的脚本，因为我试图使用xml.et.elementTree解析这些数据。这些单词的Unicode等价物是什么？

&and.
&and.A
&and.B
&and.D
&and.L's
&backquote.alim)
&backquote.ulema
&dash
&dash.
&dash."
&dashq.
&degree.
&degree.C
&ellip
&ellip.
&ellip.0
&ellip.1
&ellip.11
&ellip.2
&ellip.23
&ellip.28
&ellip.38
&ellip.4
&ellip.6
&ellip.64
&ellip.?"
&ellip.two
&times.

我的脚本：

import xml.etree.ElementTree as et
s1 = 'train-fix.xml' # from http://www.d.umn.edu/~tpederse/Data/Sval1to2.fix.tar.gz
tree = et.parse(s1)
root = tree.getroot()

出现以下回溯(traceback)：

Traceback (most recent call last):
  File "senseval.py", line 4, in <module>
    tree = et.parse(s1)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
    tree.parse(source, parser)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
    parser.feed(data)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
    self._raiseerror(v)
  File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
    raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 41, column 113

- alvas

那些不是XML实体，它们应该以;而不是.结束。实体引用：http://www.w3.org/TR/xml-entity-names/ - mata

你知道它们是什么吗？ - alvas

不是真的。dash 可能是 html5 字符实体，但是 ellip 不是任何我找到的有效实体，degree 也不是。 - mata

该页面链接的DTD文件中有一个实体列表，但没有实际字符定义。至于错误，etree是正确的：没有尾随的;，这就不是XML。 - bobince

4个回答

3

我找到了一个答案，可以使用Python的lxml包解析你的xml：

使用Python和lxml获取数据

从这里安装lxml包：http://lxml.de/

然后使用以下代码：

import lxml.html
root = lxml.html.parse('train-fix.xml').getroot()

希望它对你有用

- wilfo

+1 给 lxml 解析器，但它并不能解决 那些字符是什么？ 的问题 =( - alvas

3

基本但令人失望的答案是：它们是错别字（使用 . 而不是 ; ）。

以下是大部分错误：

times → http://www.fileformat.info/info/unicode/char/d7/index.htm
degree → http://www.fileformat.info/info/unicode/char/b0/index.htm
dash → http://www.fileformat.info/info/unicode/char/search.htm?q=dash&preview=entity
ellip → http://www.fileformat.info/info/unicode/char/2026/index.htm

……等等，对于其中一些，您必须查看上下文来判断原始文本作者是否指特定内容，还是只是打错了（dashq‽）。

在解析之前，您最适合采取的行动是使用一系列简单的字符串replace方法调用来修复混乱。

- jhermann

2

如果您有Linux可用，请使用xmllint查找错误并修复它们。

xmllint --recover ~/tmp/test-fix.xml --output ~/tmp/test-fix-fixed.xml 
/home/luis/tmp/test-fix.xml:179: parser error : EntityRef: expecting ';'
inate, Hesse and the Saarland; North Rhine-Westphalia, Baden-Wu&umlaut.rttemberg
                                                                           ^
/home/luis/tmp/test-fix.xml:179: parser error : EntityRef: expecting ';'
Bavaria would remain untouched, and the planned five East German La&umlaut.nder
...
/home/luis/tmp/test-fix.xml:3832: parser error : EntityRef: expecting ';'
Charlie Watts today) we should be ready to hit the road together as Lyndon &and.
                                                                           ^
/home/luis/tmp/test-fix.xml:3841: parser error : Opening and ending tag mismatch: corpus line 1 and lexelt
</lexelt>
     ^
/home/luis/tmp/test-fix.xml:3842: parser error : Extra content at the end of the document
<lexelt item="behaviour-n">


                                                                           ^

- LMC

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- mzjn · Accepted Answer

"words" 看起来像是格式不正确的实体引用。有效的实体引用在结尾处有一个分号。我查看了 test-fix.xml（在 Sval1to2.fix.tar.gz 中），似乎很可能 &dash（或 &dash.）代表某种短划线或连字符。该文件扩展名为.xml，如果修复坏的实体引用，则它将非常接近形成良好的XML。

在你提供的链接页面（http://www.d.umn.edu/~tpederse/data.html）上，它说：

请注意，我们转换的数据不会“解析”为真正的xml文本。这是因为在原始的标记文本中，需要在xml中进行特殊处理的字符没有被转义等等。我们正在考虑使此数据成为“真正”的xml，并且非常感谢任何有关如何最好完成此操作的反馈。

"

因此，即使该文档看起来非常像XML，它并不是XML，而且发布该文档的人们非常清楚这一点。