实际上,我现在对一些文件可以做到这一点。也就是说,对于许多我的XML文件,该过程正常进行,并且我得到我想要的输出。执行此操作的代码如下:
import os, re, csv, string, operator
import xml.etree.cElementTree as ET
import codecs
def parseEO(doc):
#getting the basic structure
tree = ET.ElementTree(file=doc)
root = tree.getroot()
agencycodes = []
rins = []
titles =[]
elements = [agencycodes, rins, titles]
#pulling in the text from the fields
for elem in tree.iter():
if elem.tag == "AGENCY_CODE":
agencycodes.append(int(elem.text))
elif elem.tag == "RIN":
rins.append(elem.text)
elif elem.tag == "TITLE":
titles.append(elem.text)
with open('parsetest.csv', 'w') as f:
writer = csv.writer(f)
writer.writerows(zip(*elements))
parseEO('EO_file.xml')
然而,在某些版本的输入文件中,我会遇到臭名昭著的错误:
'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)
完整的回溯信息如下:
---------------------------------------------------------------------------
UnicodeEncodeError Traceback (most recent call last)
<ipython-input-15-28d095d44f02> in <module>()
----> 1 execfile(r'/parsingtest.py') # PYTHON-MODE
/Users/ian/Desktop/parsingtest.py in <module>()
91 writer.writerows(zip(*elements))
92
---> 93 parseEO('/EO_file.xml')
94
95
/parsingtest.py in parseEO(doc)
89 with open('parsetest.csv', 'w') as f:
90 writer = csv.writer(f)
---> 91 writer.writerows(zip(*elements))
92
93 parseEO('/EO_file.xml')
UnicodeEncodeError: 'ascii' codec can't encode character u'\x97' in position 32: ordinal not in range(128)
通过阅读其他帖子,我相当有信心认为问题在于使用的编解码器(并且,你知道,错误信息已经非常明显了)。然而,我读到的解决方法对我没有帮助(强调一下,因为我理解我是问题的源头,而不是人们以前回答的方式)。
一些回答(例如:这个和这个和这个)没有直接涉及ElementTree,我不确定如何将解决方案转换为我正在做的事情。
其他处理ElementTree的解决方案(例如:这个和这个)要么使用短字符串(这里的第一个链接),要么使用ElementTree中的.tostring/.fromstring方法,而我没有使用。(当然,也许我应该使用。)
我尝试过但没有成功的方法:
I have attempted to bring in the file via UTF-8 encoding:
infile = codecs.open('/EO_file.xml', encoding="utf-8") parseEO(infile)
but I think the ElementTree process already understands it to be UTF-8 (which is noted in the first line of all the XML files I have), and so this is not only not correct, but is actually redundantly bad all over again.
I attempted to declare an encoding process within the loop, replacing:
tree = ET.ElementTree(file=doc)
with
parser = ET.XMLParser(encoding="utf-8") tree = ET.parse(doc, parser=parser)
in the loop above that does work. This didn't work for me either. The same files that worked before still worked, the same files that created the error still created the error.
所以,虽然我认为我写的代码既低效又得罪了良好的编程风格,但它确实能够处理几个文件。 我正在努力理解是否只是缺少我不知道的参数,是否应该对文件进行预处理(我还没有确定哪个字符有问题,但我知道u'\x97翻译成某种控制字符),或者其他选项。
type(s) == unicode
;而应该使用isinstance(s, unicode)
。 - Martijn Pieters