如何将嵌套的JSON文件识别并作为数据框的列展开?

3

我再次重新表述我的问题,以便更加清晰。 我的数据长这样。

{
    "Research": {
        "@xmlns": "http://www.xml.org/2013/2/XML",
        "@language": "eng",
        "@createDateTime": "2022-03-25T10:12:39Z",
        "@researchID": "abcd",
        "Product": {
            "@productID": "abcd",
            "StatusInfo": {
                "@currentStatusIndicator": "Yes",
                "@statusDateTime": "2022-03-25T12:18:41Z",
                "@statusType": "Published"
            },
            "Source": {
                "Organization": {
                    "@primaryIndicator": "Yes",
                    "@type": "SellSideFirm",
                    "OrganizationID": [
                        {
                            "@idType": "L1",
                            "#text": "D827C98E315F"
                        },
                        {
                            "@idType": "TR",
                            "#text": "3202"
                        },
                        {
                            "@idType": "TR",
                            "#text": "SZA"
                        }
                    ],
                    "OrganizationName": {
                        "@nameType": "Legal",
                        "#text": "Citi"
                    },
                    "PersonGroup": {
                        "PersonGroupMember": {
                            "@primaryIndicator": "Yes",
                            "@sequence": "1",
                            "Person": {
                                "@personID": "tr56",
                                "FamilyName": "Wang",
                                "GivenName": "Bond",
                                "DisplayName": "Bond Wang",
                                "Biography": "Bond Wang is a",
                                "BiographyFormatted": "Bond Wang",
                                "PhotoResourceIdRef": "AS44556"
                            }
                        }
                    }
                }
            },
            "Content": {
                "Title": "Premier",
                "Abstract": "None",
                "Synopsis": "Premier’s solid 1H22 result .",
                "Resource": [
                    {
                        "@language": "eng",
                        "@primaryIndicator": "Yes",
                        "@resourceID": "9553",
                        "Length": {
                            "@lengthUnit": "Pages",
                            "#text": "17"
                        },
                        "MIMEType": "text/html",
                        "URL": "https://www.DFKJG.com/rendition/eppublic"
                    },
                    {
                        "@language": "eng",
                        "@primaryIndicator": "No",
                        "@resourceID": "4809",
                        "Length": {
                            "@lengthUnit": "Pages",
                            "#text": "17"
                        },
                        "MIMEType": "ABS/pdf",
                        "Name": "asdf.pdf",
                        "Comments": "fr5.pdf"
                    },
                    {
                        "@language": "eng",
                        "@primaryIndicator": "No",
                        "@resourceID": "6d13a965723e",
                        "Length": {
                            "@lengthUnit": "Pages",
                            "#text": "17"
                        },
                        "MIMEType": "text/html",
                        "URL": "https://www.dfgdfg.com/"
                    },
                    {
                        "@primaryIndicator": "No",
                        "@resourceID": "709c7bdb1c99",
                        "MIMEType": "tyy/image",
                        "URL": "https://ir.ght.com"
                    },
                    {
                        "@primaryIndicator": "No",
                        "@resourceID": "gfjhgj",
                        "MIMEType": "gtty/image",
                        "URL": "https://ir.gtty.com"
                    }
                ]
            },
            "Context": {
                "@external": "Yes",
                "IssuerDetails": {
                    "Issuer": {
                        "@issuerType": "Corporate",
                        "@primaryIndicator": "Yes",
                        "SecurityDetails": {
                            "Security": {
                                "@estimateAction": "Revision",
                                "@primaryIndicator": "Yes",
                                "@targetPriceAction": "Increase",
                                "SecurityID": [
                                    {
                                        "@idType": "RIC",
                                        "@idValue": "PMV.AX",
                                        "@publisherDefinedValue": "RIC"
                                    },
                                    {
                                        "@idType": "Bloomberg",
                                        "@idValue": "PMV@AU"
                                    },
                                    {
                                        "@idType": "SEDOL",
                                        "@idValue": "6699781"
                                    }
                                ],
                                "SecurityName": "Premier Investments Ltd",
                                "AssetClass": {
                                    "@assetClass": "Equity"
                                },
                                "AssetType": {
                                    "@assetType": "Stock"
                                },
                                "SecurityType": {
                                    "@securityType": "Common"
                                },
                                "Rating": {
                                    "@rating": "NeutralSentiment",
                                    "@ratingType": "Rating",
                                    "@aspect": "Investment",
                                    "@ratingDateTime": "2020-07-31T08:24:37Z",
                                    "RatingEntity": {
                                        "@ratingEntity": "PublisherDefined",
                                        "PublisherDefinedValue": "Citi"
                                    }
                                }
                            }
                        },
                        "IssuerID": {
                            "@idType": "PublisherDefined",
                            "@idValue": "PMV.AX",
                            "@publisherDefinedValue": "TICKER"
                        },
                        "IssuerName": {
                            "@nameType": "Legal",
                            "NameValue": "Premier Investments Ltd"
                        }
                    }
                },
                "ProductDetails": {
                    "@periodicalIndicator": "No",
                    "@publicationDateTime": "2022-03-25T12:18:41Z",
                    "ProductCategory": {
                        "@productCategory": "Report"
                    },
                    "ProductFocus": {
                        "@focus": "Issuer",
                        "@primaryIndicator": "Yes"
                    },
                    "EntitlementGroup": {
                        "Entitlement": [
                            {
                                "@includeExcludeIndicator": "Include",
                                "@primaryIndicator": "No",
                                "AudienceTypeEntitlement": {
                                    "@audienceType": "PublisherDefined",
                                    "@entitlementContext": "TR",
                                    "#text": "20012"
                                }
                            },
                            {
                                "@includeExcludeIndicator": "Include",
                                "@primaryIndicator": "No",
                                "AudienceTypeEntitlement": {
                                    "@audienceType": "PublisherDefined",
                                    "@entitlementContext": "TR",
                                    "#text": "2001"
                                }
                            }
                        ]
                    }
                },
                "ProductClassifications": {
                    "Discipline": {
                        "@disciplineType": "Investment",
                        "@researchApproach": "Fundamental"
                    },
                    "Subject": {
                        "@publisherDefinedValue": "TREPS",
                        "@subjectValue": "PublisherDefined"
                    },
                    "Country": {
                        "@code": "AU",
                        "@primaryIndicator": "Yes"
                    },
                    "Region": {
                        "@primaryIndicator": "Yes",
                        "@emergingIndicator": "No",
                        "@regionType": "Australasia"
                    },
                    "AssetClass": {
                        "@assetClass": "Equity"
                    },
                    "AssetType": {
                        "@assetType": "Stock"
                    },
                    "SectorIndustry": [
                        {
                            "@classificationType": "GICS",
                            "@code": "25201040",
                            "@focusLevel": "Yes",
                            "@level": "4",
                            "@primaryIndicator": "Yes",
                            "Name": "Household Appliances"
                        },
                        {
                            "@classificationType": "GICS",
                            "@code": "25504020",
                            "@focusLevel": "Yes",
                            "@level": "4",
                            "@primaryIndicator": "Yes",
                            "Name": "Computer & Electronics Retail"
                        },
                        {
                            "@classificationType": "GICS",
                            "@code": "25504040",
                            "@focusLevel": "Yes",
                            "@level": "4",
                            "@primaryIndicator": "Yes",
                            "Name": "Specialty Stores"
                        },
                        {
                            "@classificationType": "GICS",
                            "@code": "25504030",
                            "@focusLevel": "Yes",
                            "@level": "4",
                            "@primaryIndicator": "Yes",
                            "Name": "Home Improvement Retail"
                        },
                        {
                            "@classificationType": "GICS",
                            "@code": "25201050",
                            "@focusLevel": "Yes",
                            "@level": "4",
                            "@primaryIndicator": "Yes",
                            "Name": "Housewares & Specialties"
                        }
                    ]
                }
            }
        }
    }
}

我希望将其所有元素展开成数据框。 列数可能会发生变化,且具有类似列表的结构。 基本上我们不知道下一个输入是否需要展开少量或更多列。
这是我迄今为止尝试过的,但看起来它没有给我正确的答案。 此外,我已经硬编码了列值,但它应该识别并展开。
import xmltodict as xmltodict
from pprint import pprint
import pandas as pd
import json
from tabulate import tabulate

dict =(xmltodict.parse("""xml data"""))

json_str = json.dumps(dict)
resp = json.loads(json_str)
print(resp)
df = pd.json_normalize(resp)
    
cols=['Research.Product.Source.Organization.OrganizationID','Research.Product.Content.Resource','Research.Product.Context.IssuerDetails.Issuer.SecurityDetails.Security.SecurityID','Research.Product.Context.ProductDetails.EntitlementGroup.Entitlement','Research.Product.Context.ProductClassifications.SectorIndustry']
    
def expplode_columns(df, cols):
    df_e = df.copy()
    for c in cols:
        df_e = df_e.explode(c, ignore_index=True)
    return df_e


df2 = expplode_columns(df, cols)
print(tabulate(df2, headers="keys", tablefmt="psql"))
# df2.to_csv('dataframe.csv', header=True, index=False)

1
我猜你想要递归的东西。基本上,不断地扩展一列,直到值变成字符串而不是列表为止。虽然我从未用pandas做过类似的事情,但我真的不确定它会是什么样子。 - rayad
2个回答

2

有点hacky,但是你可以提取包含列表类型的列。然后使用reduce递归地展开和规范化所有列,直到没有更多的列表/对象。

我还没有好好测试过,但大概是这样的。

from functools import reduce

def full_explode_normalize(df):
    # Extract list columns 
    explode_cols = [x for x in df.columns if isinstance(df.iloc[0][x], list)]
    if len(explode_cols) < 1:
        return df
    
    # Explode and normalize the list
    df = reduce(_explode, explode_cols, df)

    return df

def _explode(df, col):
    df = df.explode(col)

    if isinstance(df.iloc[0][col], list):
        df = _explode(df, col)
    elif isinstance(df.iloc[0][col], object):
        df_child = pd.json_normalize(df[col])
        # To prevent column name collision, add the parent column name as prefix.
        df_child.columns = [f'{col}.{x}' for x in df_child.columns]
        df = pd.concat([df.loc[:, ~df.columns.isin([col])].reset_index(drop=True), df_child], axis=1)
    
    return df

这是一个很好的解决方案。 - Yaakov Bressler

2

根据评论的建议,您可以在纯Python中定义一个辅助函数来递归地展平数据的嵌套值。

因此,使用您提供的json文件,这是一种方法:

def flatten(data, new_data):
    """Recursive helper function.

    Args:
        data: nested dictionary.
        new_data: empty dictionary.

    Returns:
        Flattened dictionary.

    """
    for key, value in data.items():
        if isinstance(value, dict):
            flatten(value, new_data)
        if isinstance(value, str) or isinstance(value, int) or isinstance(value, list):
            new_data[key] = value
    return new_data

然后:

import json

import pandas as pd

with open("file.json") as f:
    content = json.load(f)

df = pd.DataFrame.from_dict(flatten(content, {}), orient="index").T

在这里,您可以通过将包含相同键但不同值的字典列表展开并重复其他值来处理列,如下所示:

cols_with_lists = [col for col in df.columns if isinstance(df.loc[0, col], list)]

for col in cols_with_lists:
    temp_df = pd.concat(
        [pd.DataFrame(item, index=[i]) for i, item in enumerate(df.loc[0, col])],
        axis=0,
    )
    df = pd.concat([df.drop(columns=[col]), temp_df], axis=1).fillna(method="ffill")

因此,最终,JSON文件完全展开:
print(df)
# Output
                          @xmlns @language  ... @primaryIndicator                           Name
0  http://www.xml.org/2013/2/XML       eng  ...               Yes           Household Appliances
1  http://www.xml.org/2013/2/XML       eng  ...               Yes  Computer & Electronics Retail
2  http://www.xml.org/2013/2/XML       eng  ...               Yes               Specialty Stores
3  http://www.xml.org/2013/2/XML       eng  ...               Yes        Home Improvement Retail
4  http://www.xml.org/2013/2/XML       eng  ...               Yes       Housewares & Specialties

[5 rows x 73 columns]

这就是我也得到的,但问题是,我想要爆炸所有列,如OrganizationID、Resource,而不需要在explode中提及名称。 - Atharv Thakur
1
这是完全可行的,但你想如何处理重复的键?覆盖相关值?还是将它们在行中分解并重复非列表列的值? - Laurent
我们现在想要重复非列表列的值。你的代码非常干净易懂。 - Atharv Thakur
在flatten中,new_data参数是否是必需的?它有什么作用? - Pritam Dodeja
new_data是一个空字典,用于接收JSON文件中嵌套的键/值对。如果这更清晰明了,它可以从函数参数中删除并在函数内部定义。请参见我的更新答案;干杯。 - Laurent

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接