Element Tree的iter()方法跳过了随机元素。

3
我正在尝试使用Python中Element Tree的iterparse()和iter()函数解析XML文件。下面是Google Drive中文件的链接:https://drive.google.com/file/d/0B_S2Z7quow3TMl9yUk51ZzZ5UW8/view?usp=sharing
XML文件汇总了有关法院案件的数据;它被分成一系列带有标记“n-document”的元素,每个元素包含有关特定法院案件的数据子元素。我正在尝试提取所有的卷宗描述。以下是代码的简化版本:
import numpy as np
import pandas as pd
import xml.etree.ElementTree as etree
import re
import csv

for event, elem in etree.iterparse("***fileName***", events=("start", "end")):
    if event == "start":
        if elem.tag == "docket.entry":
            for element in elem.iter():
                print element.tag
                if element.text != None:
                    print element.text
                if element.tail != None:
                    print element.tail
                    print "from tail"
    elem.clear()

问题在于,在第一个案例(1613 HARVARD LIMITED PARTNERSHIP V. DISTRICT OF COLUMBIA ET AL)中,编号为25的记录描述(它们按降序编号)缺少标签“gateway.image.link”的元素的文本和尾部。具体来说,这是我得到的输出。我只是在一秒钟后取消了构建,并滚动到控制台的顶部。
docket.entry
number.block
number
28
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
ORDER GRANTING DEFENDANTS' MOTION TO DISMISS AND DENYING PLAINTIFF'S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)
docket.entry
number.block
number
27
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)
docket.entry
number.block
number
26
image.block
image.gateway.link
gateway.image.link
date
03/31/2007
docket.description
MEMORANDUM ORDER GRANTING DEFENDANTS' MOTION
image.gateway.link
21
gateway.image.link
21
 TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS' DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)
from tail
docket.entry
number.block
number
25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link
docket.entry
number.block
number
24
image.block
image.gateway.link
gateway.image.link
date
11/14/2005
docket.description
NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #
image.gateway.link
1
gateway.image.link
1
)(MULLEN, MARTHA) (ENTERED: 11/14/2005)
from tail

在输出的底部第二个,编号为25的条目中,写着:
25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link

问题在于,如果您查看XML文件本身,您会看到紧随“image.gateway.link”之后具有文本和尾部内容的带有标记“gateway.image.link”的元素,但由于某种原因iter()函数无法获取它。奇怪的是,大多数其他文档描述也都有标记为“image.gateway.link”的元素,其紧随一个标记为“gateway.image.link”的元素,就像您从24号条目(以及其他所有条目)中所看到的那样,而iter()函数可以识别它们但不能识别这个。下面是来自我上面贴的Google Drive文档的摘录XML代码:
<?xml version="1.0" encoding="UTF-8" ?><n-extract-response>
<docket.entries.block><label>Entry #:</label><label>Date:</label><label>Description:</label><docket.entry><number.block><number>28</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A1-280450912204" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>ORDER GRANTING DEFENDANTS&apos; MOTION TO DISMISS AND DENYING PLAINTIFF&apos;S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>27</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A2-2704501909813" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>26</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A4-2604501672579" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>03/31/2007</date><docket.description>MEMORANDUM ORDER GRANTING DEFENDANTS&apos; MOTION<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">21</image.gateway.link><gateway.image.link ID="B3-21-0450561212" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">21</gateway.image.link> TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS&apos; DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)</docket.description></docket.entry><docket.entry><number.block><number>25</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A6-2504501577842" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/15/2005</date><docket.description>RESPONSE TO DEFENDANTS&apos; NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B5-1-04511581037" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link> EXHIBIT 1 - NOTICE OF APPEAL)(WISE, RICHARD) (ENTERED: 11/15/2005)</docket.description></docket.entry><docket.entry><number.block><number>24</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A8-2404501579104" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/14/2005</date><docket.description>NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B7-1-04511577643" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link>)(MULLEN, MARTHA) (ENTERED: 11/14/2005)</docket.description></docket.entry></docket.entries.block>
</n-extract-response>

当我将我的Python脚本按照上述精确剪贴运行时,它会得到缺失的元素。但是当我在整个XML文件上运行脚本时,就不会了,如之前所示。显然,节选内容缺少许多元素,但我不明白为什么这会影响iter()函数,因为我没有拆分"docket.entry"元素/子元素,而这正是我的代码中的for循环每次要处理的内容(我想)。问题不仅限于第25个条目--这里还有一些其他提取的诉讼说明书缺少一个子元素,但我无法确定任何模式--我甚至无法区分导致该问题的第25个和第24个条目之间的差异。有人能帮忙吗?

请问您能否在这个问题中发布相关的XML部分? - Anand S Kumar
Mata 发现了我做错的地方。不过还是谢谢你愿意帮忙看一下! - blu
3个回答

1
您正在尝试在开始事件上处理元素的子级,但是iterparse的工作方式并不保证它们已经被读取。文档中有一个关于此的注释:iterparse()只保证在发出“start”事件时已经看到了起始标签的“>”字符,因此属性已定义,但是text和tail属性的内容在那时是未定义的。元素子代也是如此;它们可能存在,也可能不存在。如果您需要完全填充的元素,请查找“end”事件。如果您想要处理元素的子级,则需要在结束事件上执行,否则无法保证元素内容的可用性。这是为什么您会得到任何内容的原因。详见此处。
注意:
树构建器和事件生成器不一定同步;后者通常会滞后一段时间。这意味着当您收到元素的“开始”事件时,构建器可能已经填充了该元素的内容。但是您不能依赖此功能 - “开始”事件只能用于检查属性,而不能用于检查元素内容。有关更多详细信息,请参见this message

iter()函数不会处理结束事件吗?另外,当我将上面的代码中的“if event ==“start””更改为“if event == end”,输出只是一遍又一遍地显示“docket.entry”,偶尔会输出“date”和“docket.description”标签输出——这是因为如果已经到达“docket.entry”的结束标记,它就没有任何子元素可以迭代了吗? - blu
这可能是因为您无条件地调用了 elem.clear(),对于所有元素都是如此。只有在您不再需要元素及其内容时才应该这样做,否则您会在到达结束事件之前清除子元素。iter() 不会生成或消耗事件,它迭代已在内存中构建的元素树,直到此时为止。 - mata

0

getchildren自2.7版本起已被弃用:请使用list(elem)或迭代。


这并没有提供问题的答案。如果你想对作者进行批评或请求澄清,请在他们的帖子下方留言 - 你总是可以评论自己的帖子,一旦你获得足够的声望,你就能够评论任何帖子了。- 来自审查 - SidOfc
我没有足够的“声望”来评论Kaneg的帖子。这就是为什么我添加了一个答案... - Nado

-1
也许您可以选择根据 XML 文件的逻辑顺序进行解析,这样您就可以精确控制每个元素。例如:
import xml.etree.ElementTree as ET

tree = ET.parse(r'<xml file name>')
root = tree.getroot()
docket_entries = root.findall('.//docket.entry')
for entry in docket_entries:
    number = entry.find('.//number')
    print number.text
    description = entry.find('docket.description')
    print description.text
    for child in description.getchildren():
        print child

iterparse 用于迭代解析大型 XML 文档,而无需像您的解决方案一样将整个树保存在内存中。 - mata

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接