嵌套XML转换为Pandas数据框

3
我正在尝试创建一个脚本,将嵌套的XML文件转换为Pandas数据帧。我找到了这篇文章https://medium.com/@robertopreste/from-xml-to-pandas-dataframes-9292980b1c1c,它很好地完成了到第二层(父级、子级)的工作,但我不知道如何深入到更深的层次(例如孙级),也不知道如何获取子级的属性(例如“neighbor” -> “name”)。以下是我的XML结构:
<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
            <neighbor2 name="Italy" direction="S"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>

这是我的代码:

import pandas as pd
import xml.etree.ElementTree as et

def parse_XML(xml_file, df_cols): 

    xtree = et.parse(xml_file)
    xroot = xtree.getroot()
    rows = []

    for node in xroot: 
        res = []
        res.append(node.attrib.get(df_cols[0]))
        for el in df_cols[1:]: 
            if node is not None and node.find(el) is not None:
                res.append(node.find(el).text)
            else: 
                res.append(None)
        rows.append({df_cols[i]: res[i] 
                     for i, _ in enumerate(df_cols)})

    out_df = pd.DataFrame(rows, columns=df_cols)

    return out_df

xml_file= "example.xml"
df_cols = ["name","year","direction"]

out_df=parse_XML(xml_file, df_cols)
out_df

我希望得到的结构如下所示:
| name          | year | neighbor name 1 | neighbor direction 1 | neighbor2 name 1 |
|---------------|------|-----------------|----------------------|------------------|
| Liechtenstein | 2008 | Austria         | E                    | Italy            |
|               |      |                 |                      |                  |
|               |      |                 |                      |                  |

结构需要尽可能灵活,这样在与不同文件一起使用时只需要进行少量编辑。我将获得具有不同数据结构的XML文件,因此希望每次都能进行最少的编辑。谢谢!
2个回答

10
我已经为类似的用例制作了一个包。它在这里也可以使用。
pip install pandas_read_xml

你可以这样做

import pandas_read_xml as pdx

df = pdx.read_xml('filename.xml', ['data'])

为了使某个数据扁平化,可以:
df = pdx.flatten(df)

或者
df = pdx.fully_flatten(df)

看起来是一个相当有用的图书馆 - 希望能够继续发展下去! - anakaine
啊,这让XML变得简单多了。fully_flatten 应该是原生的pandas! - lys

6
你需要一个递归函数来展平行,以及处理重复数据的机制。
这很混乱,取决于数据和嵌套程度,你可能会得到相当奇怪的数据框架。
import xml.etree.ElementTree as et
from collections import defaultdict
import pandas as pd


def flatten_xml(node, key_prefix=()):
    """
    Walk an XML node, generating tuples of key parts and values.
    """

    # Copy tag content if any
    text = (node.text or '').strip()
    if text:
        yield key_prefix, text

    # Copy attributes
    for attr, value in node.items():
        yield key_prefix + (attr,), value

    # Recurse into children
    for child in node:
        yield from flatten_xml(child, key_prefix + (child.tag,))


def dictify_key_pairs(pairs, key_sep='-'):
    """
    Dictify key pairs from flatten_xml, taking care of duplicate keys.
    """
    out = {}

    # Group by candidate key.
    key_map = defaultdict(list)
    for key_parts, value in pairs:
        key_map[key_sep.join(key_parts)].append(value)

    # Figure out the final dict with suffixes if required.
    for key, values in key_map.items():
        if len(values) == 1:  # No need to suffix keys.
            out[key] = values[0]
        else:  # More than one value for this key.
            for suffix, value in enumerate(values, 1):
                out[f'{key}{key_sep}{suffix}'] = value

    return out


# Parse XML with etree
tree = et.XML("""<?xml version="1.0"?>
<data>
    <country name="Liechtenstein">
        <rank>1</rank>
        <year>2008</year>
        <gdppc>141100</gdppc>
        <neighbor name="Austria" direction="E"/>
        <neighbor name="Switzerland" direction="W"/>
        <neighbor2 name="Italy" direction="S"/>
    </country>
    <country name="Singapore">
        <rank>4</rank>
        <year>2011</year>
        <gdppc>59900</gdppc>
        <neighbor name="Malaysia" direction="N"/>
        <cities>
            <city name="Chargin" population="1234" />
            <city name="Firin" population="4567" />
        </cities>
    </country>
    <country name="Panama">
        <rank>68</rank>
        <year>2011</year>
        <gdppc>13600</gdppc>
        <neighbor name="Costa Rica" direction="W"/>
        <neighbor name="Colombia" direction="E"/>
    </country>
</data>
""")

# Generate flat rows out of the root nodes in the tree
rows = [dictify_key_pairs(flatten_xml(row)) for row in tree]
df = pd.DataFrame(rows)
print(df)

输出

            name rank  year   gdppc neighbor-name-1 neighbor-name-2 neighbor-direction-1 neighbor-direction-2 neighbor2-name neighbor2-direction neighbor-name neighbor-direction cities-city-name-1 cities-city-name-2 cities-city-population-1 cities-city-population-2
0  Liechtenstein    1  2008  141100         Austria     Switzerland                    E                    W          Italy                   S           NaN                NaN                NaN                NaN                      NaN                      NaN
1      Singapore    4  2011   59900             NaN             NaN                  NaN                  NaN            NaN                 NaN      Malaysia                  N            Chargin              Firin                     1234                     4567
2         Panama   68  2011   13600      Costa Rica        Colombia                    W                    E            NaN                 NaN           NaN                NaN                NaN                NaN                      NaN                      NaN

太棒了,谢谢!现在我只想解析XML文件,而不是将XML文本复制粘贴到代码中。我尝试用tree=et.parse('example.xml')替换tree=et.XML(),但是出现了以下错误:“ElementTree对象不可迭代”。我应该怎么做才能解决这个问题? - user3119334
tree = et.parse(...).getroot() - AKX

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接