我正在尝试使用Python中Element Tree的iterparse()和iter()函数解析XML文件。下面是Google Drive中文件的链接:https://drive.google.com/file/d/0B_S2Z7quow3TMl9yUk51ZzZ5UW8/view?usp=sharing。
XML文件汇总了有关法院案件的数据;它被分成一系列带有标记“n-document”的元素,每个元素包含有关特定法院案件的数据子元素。我正在尝试提取所有的卷宗描述。以下是代码的简化版本:
问题在于,在第一个案例(1613 HARVARD LIMITED PARTNERSHIP V. DISTRICT OF COLUMBIA ET AL)中,编号为25的记录描述(它们按降序编号)缺少标签“gateway.image.link”的元素的文本和尾部。具体来说,这是我得到的输出。我只是在一秒钟后取消了构建,并滚动到控制台的顶部。
在输出的底部第二个,编号为25的条目中,写着:
问题在于,如果您查看XML文件本身,您会看到紧随“image.gateway.link”之后具有文本和尾部内容的带有标记“gateway.image.link”的元素,但由于某种原因iter()函数无法获取它。奇怪的是,大多数其他文档描述也都有标记为“image.gateway.link”的元素,其紧随一个标记为“gateway.image.link”的元素,就像您从24号条目(以及其他所有条目)中所看到的那样,而iter()函数可以识别它们但不能识别这个。下面是来自我上面贴的Google Drive文档的摘录XML代码:
当我将我的Python脚本按照上述精确剪贴运行时,它会得到缺失的元素。但是当我在整个XML文件上运行脚本时,就不会了,如之前所示。显然,节选内容缺少许多元素,但我不明白为什么这会影响iter()函数,因为我没有拆分"docket.entry"元素/子元素,而这正是我的代码中的for循环每次要处理的内容(我想)。问题不仅限于第25个条目--这里还有一些其他提取的诉讼说明书缺少一个子元素,但我无法确定任何模式--我甚至无法区分导致该问题的第25个和第24个条目之间的差异。有人能帮忙吗?
XML文件汇总了有关法院案件的数据;它被分成一系列带有标记“n-document”的元素,每个元素包含有关特定法院案件的数据子元素。我正在尝试提取所有的卷宗描述。以下是代码的简化版本:
import numpy as np
import pandas as pd
import xml.etree.ElementTree as etree
import re
import csv
for event, elem in etree.iterparse("***fileName***", events=("start", "end")):
if event == "start":
if elem.tag == "docket.entry":
for element in elem.iter():
print element.tag
if element.text != None:
print element.text
if element.tail != None:
print element.tail
print "from tail"
elem.clear()
问题在于,在第一个案例(1613 HARVARD LIMITED PARTNERSHIP V. DISTRICT OF COLUMBIA ET AL)中,编号为25的记录描述(它们按降序编号)缺少标签“gateway.image.link”的元素的文本和尾部。具体来说,这是我得到的输出。我只是在一秒钟后取消了构建,并滚动到控制台的顶部。
docket.entry
number.block
number
28
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
ORDER GRANTING DEFENDANTS' MOTION TO DISMISS AND DENYING PLAINTIFF'S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)
docket.entry
number.block
number
27
image.block
image.gateway.link
gateway.image.link
date
07/19/2007
docket.description
MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)
docket.entry
number.block
number
26
image.block
image.gateway.link
gateway.image.link
date
03/31/2007
docket.description
MEMORANDUM ORDER GRANTING DEFENDANTS' MOTION
image.gateway.link
21
gateway.image.link
21
TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS' DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)
from tail
docket.entry
number.block
number
25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link
docket.entry
number.block
number
24
image.block
image.gateway.link
gateway.image.link
date
11/14/2005
docket.description
NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #
image.gateway.link
1
gateway.image.link
1
)(MULLEN, MARTHA) (ENTERED: 11/14/2005)
from tail
在输出的底部第二个,编号为25的条目中,写着:
25
image.block
image.gateway.link
gateway.image.link
date
11/15/2005
docket.description
RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #
image.gateway.link
问题在于,如果您查看XML文件本身,您会看到紧随“image.gateway.link”之后具有文本和尾部内容的带有标记“gateway.image.link”的元素,但由于某种原因iter()函数无法获取它。奇怪的是,大多数其他文档描述也都有标记为“image.gateway.link”的元素,其紧随一个标记为“gateway.image.link”的元素,就像您从24号条目(以及其他所有条目)中所看到的那样,而iter()函数可以识别它们但不能识别这个。下面是来自我上面贴的Google Drive文档的摘录XML代码:
<?xml version="1.0" encoding="UTF-8" ?><n-extract-response>
<docket.entries.block><label>Entry #:</label><label>Date:</label><label>Description:</label><docket.entry><number.block><number>28</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A1-280450912204" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|0450912204;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>ORDER GRANTING DEFENDANTS' MOTION TO DISMISS AND DENYING PLAINTIFF'S MOTION FOR LEAVE TO FILE A SECOND AMENDED COMPLAINT. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1, ) (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>27</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A2-2704501909813" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501909813;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>07/19/2007</date><docket.description>MEMORANDUM OPINION. SIGNED BY JUDGE RICHARD W. ROBERTS ON 7/19/07. (LCRWR1) MODIFIED ON 7/19/2007 (LCRWR1, ). (ENTERED: 07/19/2007)</docket.description></docket.entry><docket.entry><number.block><number>26</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A4-2604501672579" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501672579;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>03/31/2007</date><docket.description>MEMORANDUM ORDER GRANTING DEFENDANTS' MOTION<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">21</image.gateway.link><gateway.image.link ID="B3-21-0450561212" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|0450561212;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">21</gateway.image.link> TO STAY DISCOVERY PENDING RESOLUTION OF DEFENDANTS' DISPOSITIVE MOTION FILED BY PATRICK J. CANAVAN, PAUL E. WATERS. SIGNED BY JUDGE RICHARD W. ROBERTS ON 3/31/07. (LCRWR1) ADDITIONAL ATTACHMENT(S) ADDED ON 4/3/2007 (LCRWR1, ). (ENTERED: 04/02/2007)</docket.description></docket.entry><docket.entry><number.block><number>25</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A6-2504501577842" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501577842;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/15/2005</date><docket.description>RESPONSE TO DEFENDANTS' NOTICE OF COURT RULING IN RELATED CASE FILED BY 1613 HARVARD LIMITED PARTNERSHIP. (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B5-1-04511581037" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511581037;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link> EXHIBIT 1 - NOTICE OF APPEAL)(WISE, RICHARD) (ENTERED: 11/15/2005)</docket.description></docket.entry><docket.entry><number.block><number>24</number><image.block><image.gateway.link casenumber="1:05cv00726" court="DCDCT-DW" image.ID="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" item.type="main" platform="ecf"></image.gateway.link><gateway.image.link ID="A8-2404501579104" casenumber="1:05cv00726" court="DCDCT-DW" item.type="main" key="godls|04501579104;court=DCDCT-DW;casenumber=1:05cv00726" tlr-class="gateway-image-link" ttype="ecf"></gateway.image.link></image.block></number.block><date>11/14/2005</date><docket.description>NOTIFICATION OF SUPPLEMENTAL AUTHORITY BY DISTRICT OF COLUMBIA, PATRICK J. CANAVAN, PAUL E. WATERS (ATTACHMENTS: #<image.gateway.link casenumber="1:05CV00726" court="DCDCT-DW" image.id="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" item.type="ATTACHMENT" platform="ECF">1</image.gateway.link><gateway.image.link ID="B7-1-04511577643" casenumber="1:05CV00726" court="DCDCT-DW" item.type="ATTACHMENT" key="godls|04511577643;court=DCDCT-DW;casenumber=1:05CV00726" tlr-class="gateway-image-link" ttype="ECF">1</gateway.image.link>)(MULLEN, MARTHA) (ENTERED: 11/14/2005)</docket.description></docket.entry></docket.entries.block>
</n-extract-response>
当我将我的Python脚本按照上述精确剪贴运行时,它会得到缺失的元素。但是当我在整个XML文件上运行脚本时,就不会了,如之前所示。显然,节选内容缺少许多元素,但我不明白为什么这会影响iter()函数,因为我没有拆分"docket.entry"元素/子元素,而这正是我的代码中的for循环每次要处理的内容(我想)。问题不仅限于第25个条目--这里还有一些其他提取的诉讼说明书缺少一个子元素,但我无法确定任何模式--我甚至无法区分导致该问题的第25个和第24个条目之间的差异。有人能帮忙吗?