将json列转换为标准的pandas数据框

Question

将json列转换为标准的pandas数据框

3

我有一个带有 JSON 格式列的 Pandas 数据帧，如下所示。

id	date	gender	response
1	1/14/2021	M	"{'score':3,'reason':{'description':array(['a','b','c'])}"
2	5/16/2020	F	"{'score':4,'reason':{'description':array(['x','y','z'])}"

我想通过展平响应列中的字典将其转换为数据帧。字典被存储为数据库中的字符串。

在 Python 中是否有一种简单的方法将响应列转换为字典对象，然后将其展平到类似下面的数据帧中？

id	date	gender	score	description
1	1/14/2021	M	3	a
1	1/14/2021	M	3	b
1	1/14/2021	M	3	c
2	5/16/2020	F	4	x
2	5/16/2020	F	4	y
2	5/16/2020	F	4	z

- Sudhakar Samak

你好，你如何获取初始数据框？它来自 .json 文件吗？如果是这样，你用什么代码导入的？你可以使用 pd.json_normalize 来展开它。 - NoobVB

1个回答

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Laurent · Answer 1

根据您提供的数据框：

import pandas as pd

df = pd.DataFrame(
    {
        "id": [1, 2],
        "date": ["1/14/2021", "5/16/2020"],
        "gender": ["M", "F"],
        "response": [
            "{'score':3,'reason':{'description':array(['a','b','c'])}",
            "{'score':4,'reason':{'description':array(['x','y','z'])}",
        ],
    }
)

你可以定义一个函数来展开 response 列中的值：

def flatten(data, new_data):
    """Recursive helper function.

    Args:
        data: nested dictionary.
        new_data: empty dictionary.

    Returns:
        Flattened dictionary.

    """
    for key, value in data.items():
        if isinstance(value, list):
            for item in value:
                flatten(item, new_data)
        if isinstance(value, dict):
            flatten(value, new_data)
        if (
            isinstance(value, str)
            or isinstance(value, int)
            or isinstance(value, ndarray)
        ):
            new_data[key] = value
    return new_data

然后，使用Numpy ndarrays来处理数组，并使用Python标准库eval内置函数将response列中的字符串转换为字典，按照以下步骤进行：

import numpy as np
from numpy import ndarray

# In your example, closing curly braces are missing, hence the "+ '}'"
df["response"] = df["response"].apply(
    lambda x: flatten(eval(x.replace("array", "np.array") + "}"), {})
)

# For each row, flatten nested dict, make a dataframe of it
# and concat it with non nested columns
# Then, concat all new dataframes
new_df = pd.concat(
    [
        pd.concat(
            [
                pd.DataFrame(df.loc[idx, :]).T.drop(columns="response"),
                pd.DataFrame(df.loc[idx, "response"]).reset_index(drop=True),
            ],
            axis=1,
        ).fillna(method="ffill")
        for idx in df.index
    ]
).reset_index(drop=True)

所以：

print(new_df)
# Output
   id       date gender  score description
0   1  1/14/2021      M      3           a
1   1  1/14/2021      M      3           b
2   1  1/14/2021      M      3           c
3   2  5/16/2020      F      4           y
4   2  5/16/2020      F      4           x
5   2  5/16/2020      F      4           z