如何设置 pandas 的 read_xml 函数的 xpath 参数?

5
我想从一个XML文件中解析数据,该文件的组件部分在这里,使用Component标签。
<Component>
  <UnderlyingSecurityID>300001</UnderlyingSecurityID>
  <UnderlyingSecurityIDSource>102</UnderlyingSecurityIDSource>
  <UnderlyingSymbol>特锐德</UnderlyingSymbol>
  <ComponentShare>300.00</ComponentShare>
  <SubstituteFlag>1</SubstituteFlag>
  <PremiumRatio>0.25000</PremiumRatio>
  <CreationCashSubstitute>0.0000</CreationCashSubstitute>
  <RedemptionCashSubstitute>0.0000</RedemptionCashSubstitute>
</Component>
<Component>
  <UnderlyingSecurityID>300003</UnderlyingSecurityID>
  <UnderlyingSecurityIDSource>102</UnderlyingSecurityIDSource>
  <UnderlyingSymbol>乐普医疗</UnderlyingSymbol>
  <ComponentShare>600.00</ComponentShare>
  <SubstituteFlag>1</SubstituteFlag>
  <PremiumRatio>0.25000</PremiumRatio>
  <CreationCashSubstitute>0.0000</CreationCashSubstitute>
  <RedemptionCashSubstitute>0.0000</RedemptionCashSubstitute>
</Component>

我已经安装了最新版本的lxml和pandas,尝试了以下代码但没有成功。

Python 3.9.4 (tags/v3.9.4:1f2e308, Apr  6 2021, 13:40:21) [MSC v.1928 64 bit (AMD64)]
Type 'copyright', 'credits' or 'license' for more information
IPython 7.25.0 -- An enhanced Interactive Python. Type '?' for help.

In [1]: import pandas as pd

In [2]: pd.__version__
Out[2]: '1.3.0'

In [3]: xml = pd.read_xml('https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml', xpath='//component')
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-3-67d228028cc9> in <module>
----> 1 xml = pd.read_xml('https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml', xpath='//component')

...
    501         if elems == []:
--> 502             raise ValueError(msg)
    503 
    504         if elems != [] and attrs == [] and children == []:

ValueError: xpath does not return any nodes. Be sure row level nodes are in xpath. If document uses namespaces denoted with xmlns, be sure to define namespaces and use them in xpath.

In [4]: xml = pd.read_xml('https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml', xpath='//component', namespaces={'com': 'http://ts.szse.cn/Fund'})
---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
<ipython-input-4-52fbe542dadb> in <module>
----> 1 xml = pd.read_xml('https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml', xpath='//component', namespaces={'com': 'http://ts.szse.cn/Fund'})

...
    501         if elems == []:
--> 502             raise ValueError(msg)
    503 
    504         if elems != [] and attrs == [] and children == []:

ValueError: xpath does not return any nodes. Be sure row level nodes are in xpath. If document uses namespaces denoted with xmlns, be sure to define namespaces and use them in xpath.

我也尝试了直接使用 lxml,似乎可以工作:
In [5]: from lxml import etree
In [6]: import requests
In [7]: content = requests.get('https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml').content

In [8]: html = etree.HTML(content)
In [9]: html.xpath('//component')
Out[9]: 
[<Element component at 0x1d493cb23c0>,
 <Element component at 0x1d493cb2340>,
 <Element component at 0x1d493cb2240>,
 <Element component at 0x1d493cb22c0>,
 <Element component at 0x1d493cb2140>,
 <Element component at 0x1d493cb2040>,
 <Element component at 0x1d493cb2c40>,
 <Element component at 0x1d493cb61c0>,
 <Element component at 0x1d493cb63c0>,
 <Element component at 0x1d493cb2200>,
 ...

我不知道为什么read_xml无法正常工作。希望能得到帮助!


1
我不确定,但我认为您在xpath中忘记了点号 --> './/Component' - coco18
它不起作用。@coco18 - esse
3
试试加上命名空间的方式调用:pd.read_xml(file_path, xpath=".//doc:Component", namespaces={"doc":"http://ts.szse.cn/Fund"}) - sammywemmy
2
@sammywemmy 谢谢,这个可行。我发现前缀点并不重要,但xpath是大小写敏感的。 - esse
2个回答

1
因此,简而言之,这里的解决方案是找出您想要的节点,即Component(区分大小写),并将xpath设置如下,添加//
pd.read_xml(your_xml_file, xpath='//Component')

1
你可以使用xml.etree.ElementTree,而不是 pd.xml_read():
import xml.etree.ElementTree as ET
import pandas as pd
import requests

url = 'https://www.huaan.com.cn/etf/159949/etffiledownload.jsp?etffilename=pcf_159949_20210707.xml'
response = requests.get(url)
res = ET.fromstring(response.content)

tree = ET.ElementTree(res)
root = tree.getroot()

namespace = "{http://ts.szse.cn/Fund}"

columns =['UnderlyingSecurityID', 'UnderlyingSecurityIDSource', 'UnderlyingSymbol', 'ComponentShare', 'SubstituteFlag', 'PremiumRatio','CreationCashSubstitute', 'RedemptionCashSubstitute']

data = []
for elem in root: 
    if elem.tag == f"{namespace}Components":
        com_l = []
        for com in elem.findall(f"{namespace}Component"):
            for val in com:
                com_l.append(val.text)
            data.append(com_l)
            com_l =[]

df = pd.DataFrame(data, columns=columns)
print(df.to_string())

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接