这里有一个XML和HTML字符引用列表:https://en.wikipedia.org/wiki/List_of_XML_and_HTML_character_entity_references。然而,在这个列表中,有些东西没有定义,但它们在旧的HTML脚本中被使用。当我处理http://www.d.umn.edu/~tpederse/data.html的
Senseval-2 format (with fixes)
数据集时,我遇到了以下单词,这些单词破坏了我的脚本,因为我试图使用xml.et.elementTree
解析这些数据。这些单词的Unicode等价物是什么?
&and.
&and.A
&and.B
&and.D
&and.L's
&backquote.alim)
&backquote.ulema
&dash
&dash.
&dash."
&dashq.
°ree.
°ree.C
&ellip
&ellip.
&ellip.0
&ellip.1
&ellip.11
&ellip.2
&ellip.23
&ellip.28
&ellip.38
&ellip.4
&ellip.6
&ellip.64
&ellip.?"
&ellip.two
×.
我的脚本:
import xml.etree.ElementTree as et
s1 = 'train-fix.xml' # from http://www.d.umn.edu/~tpederse/Data/Sval1to2.fix.tar.gz
tree = et.parse(s1)
root = tree.getroot()
出现以下回溯(traceback):
Traceback (most recent call last):
File "senseval.py", line 4, in <module>
tree = et.parse(s1)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1182, in parse
tree.parse(source, parser)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 656, in parse
parser.feed(data)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1642, in feed
self._raiseerror(v)
File "/usr/lib/python2.7/xml/etree/ElementTree.py", line 1506, in _raiseerror
raise err
xml.etree.ElementTree.ParseError: not well-formed (invalid token): line 41, column 113
;
而不是.
结束。实体引用:http://www.w3.org/TR/xml-entity-names/ - matadash
可能是 html5 字符实体,但是ellip
不是任何我找到的有效实体,degree
也不是。 - mata;
,这就不是XML。 - bobince