如何在Python中从URL读取XML文件?

8

我希望能够访问子节点中的信息。这是因为文件结构的原因吗?

尝试从文件中单独提取作者子节点信息并运行Python代码。那样可以正常工作。

import urllib
import xml.etree.ElementTree as ET

url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'

print 'Retrieving', url

document = urllib.urlopen (url).read()
print 'Retrieved', len(document), 'characters.'

print document[:50]

tree = ET.fromstring(document)

lst = tree.findall('title')
print lst[:100]

提供的答案有什么进展了吗? - iLuvLogix
3个回答

5

由于命名空间的原因,您找不到标题元素。

以下是查找标题的示例代码:

  • 从“document”标记中获取标题
  • 从内部“component”标记中获取标题
    import xml.etree.ElementTree as ET
    import urllib.request

    url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'
    response = urllib.request.urlopen(url).read()
    tree = ET.fromstring(response)


    for docTitle in tree.findall('{urn:hl7-org:v3}title'):
        print(docTitle.text)

    for compTitle in tree.findall('.//{urn:hl7-org:v3}title'):
        print(compTitle.text)

更新

如果您需要搜索XML节点,则应使用xPath表达式

示例:

NS = '{urn:hl7-org:v3}'
ID = '829076996'    # ID TO BE FOUND

# XPATH TO FIND AUTHORS BY ID (search ID and return related author node)
xPathAuthorById = ''.join([
    ".//",
    NS, "author/",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "id[@extension='", ID,
    "']/../../.."
    ])

# XPATH TO FIND AUTHOR NAME ELEMENT
xPathAuthorName = ''.join([
    "./",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "name"
    ])

# FOR EACH AUTHOR FOUND, SEARCH ATTRIBUTES (example name)
for author in tree.findall(xPathAuthorById):
    name = author.find(xPathAuthorName)
    print(name.text)

这个例子打印了ID 829076996的作者名称。

更新2

你可以使用findall方法轻松处理所有assignedEntity标签。 对于每个标签,您可以拥有多个产品,因此需要另一个findall方法(请参见下面的示例)。

xPathAssignedEntities = ''.join([
    ".//",
    NS, "author/",
    NS, "assignedEntity/",
    NS, "representedOrganization/",
    NS, "assignedEntity/", 
    NS, "assignedOrganization/", 
    NS, "assignedEntity"
    ])

xPathProdCode = ''.join([
    NS, "actDefinition/",
    NS, "product/",
    NS, "manufacturedProduct/",
    NS, "manufacturedMaterialKind/",
    NS, "code"
    ])


# GET ALL assignedEntity TAGS
for assignedEntity in tree.findall(xPathAssignedEntities):

    # GET ID AND NAME OF assignedEntity
    id = assignedEntity.find(NS + 'assignedOrganization/'+ NS + 'id').get('extension')
    name = assignedEntity.find(NS + 'assignedOrganization/' + NS + 'name').text

    # FOR EACH assignedEntity WE CAN HAVE MULTIPLE <performance> TAGS
    for performance in assignedEntity.findall(NS + 'performance'):
        actCode = performance.find(NS + 'actDefinition/'+ NS + 'code').get('displayName')
        prodCode = performance.find(xPathProdCode).get('code')
        print(id, '\t', name, '\t', actCode, '\t', prodCode)

这是结果:
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-0050 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4900 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4910 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4940 
829084545    Pfizer Pharmaceuticals LLC      ANALYSIS    0049-4960 
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-0050
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4900
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4910
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4940
829084545    Pfizer Pharmaceuticals LLC      API MANUFACTURE     0049-4960
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4900 
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4910 
829084545    Pfizer Pharmaceuticals LLC      MANUFACTURE     0049-4960 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4900 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4910 
829084545    Pfizer Pharmaceuticals LLC      PACK    0049-4960 
618054084    Pharmacia and Upjohn Company LLC    ANALYSIS    0049-0050
618054084    Pharmacia and Upjohn Company LLC    ANALYSIS    0049-4940
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4900 
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4910 
829084552    Pfizer Pharmaceuticals LLC      PACK    0049-4960

嘿,太棒了。有没有办法查找子节点<author>下的属性,例如<id ="089153071">? 此外,我想访问特定节点中的其他信息,比如只从<author/assignedEntity>子节点中获取<name>。 - PANKAJ KUMAR
嗨,我刚刚更新了代码,并提供了一个简单的示例,演示如何使用xPath搜索XML。希望你会觉得有用。 - manuel_b
谢谢Klaud。我尝试修改xPath以获取更多的子节点,但是我无法检索到该特定子节点(assignedOrganization)的名称。我想要实现的是这种格式的输出: 输出的第一行(618054084,Pharmacia and Upjohn Company LLC,ANALYSIS,0049-4940)......输出的第二行(829084552,Pfizer Pharmaceuticals LLC,PACK,0049-4900)。感谢您的帮助。 - PANKAJ KUMAR
嗨PANKAJ KUMAR,我试图进一步解释xPath和find/findall方法的使用,并提供了一个例子。希望对你有所帮助。最好的祝福! - manuel_b
嘿,Klaud.. 这太棒了。非常感谢你为帮助我解决这个问题所做的一切努力。感激不尽。 谢谢伙计。 - PANKAJ KUMAR

4
你可以使用xmltodict来从请求的XML数据生成Python字典。
以下是一个基本示例:
import urllib2
import xmltodict

def foobar(request):
    file = urllib2.urlopen('https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml')
    data = file.read()
    file.close()

    data = xmltodict.parse(data)
    return {'xmldata': data}

3

我通常更喜欢使用beautifulsouplxml解析器来解析xml。

以下是示例代码:

import requests
from bs4 import BeautifulSoup

url = 'https://dailymed.nlm.nih.gov/dailymed/services/v2/spls/fe9e8b7d-61ea-409d-84aa-3ebd79a046b5.xml'

document = requests.get(url)

soup= BeautifulSoup(document.content,"lxml-xml")
print (soup.find("title"))

输出

<title>These highlights do not include all the information needed to use ZOLOFT safely and effectively. See full prescribing information for ZOLOFT. <br/>
<br/>ZOLOFT (sertraline hydrochloride) tablets, for oral use <br/>ZOLOFT (sertraline hydrochloride) oral solution <br/>Initial U.S. Approval: 1991</title>

您可以使用beautifulsoup提供的方法,如findfind_all来查找相应的节点或子节点。


网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接