在Pandas数据框中展开嵌套的Json

19

我试图将json文件加载到pandas数据框中。 我发现有一些嵌套的json。 下面是示例json:

{'events': [{'id': 142896214,
   'playerId': 37831,
   'teamId': 3157,
   'matchId': 2214569,
   'matchPeriod': '1H',
   'eventSec': 0.8935539999999946,
   'eventId': 8,
   'eventName': 'Pass',
   'subEventId': 85,
   'subEventName': 'Simple pass',
   'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}],
   'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}

我用下面的代码将json加载到数据框中:

with open('EVENTS.json') as f:
    jsonstr = json.load(f)

df = pd.io.json.json_normalize(jsonstr['events'])

以下是df.head()的输出:

df的输出

这是输出结果

但我发现有两个嵌套的列,分别为positions和tags。

我尝试使用以下代码对其进行展平:

Position_data = json_normalize(data =jsonstr['events'], record_path='positions', meta = ['x','y','x','y'] )

它显示了以下错误:

KeyError: "Try running with errors='ignore' as key 'x' is not always present"

你能指导我如何展平职位和标签(那些具有嵌套数据的标签)吗?

谢谢, Zep

2个回答

38

如果您正在寻找从JSON中展开多个层次结构的更通用方法,您可以使用递归和列表推导来重塑数据。以下是一种替代方案:

def flatten_json(nested_json, exclude=['']):
    """Flatten json object with nested keys into a single level.
        Args:
            nested_json: A nested json object.
            exclude: Keys to exclude from output.
        Returns:
            The flattened json object if successful, None otherwise.
    """
    out = {}

    def flatten(x, name='', exclude=exclude):
        if type(x) is dict:
            for a in x:
                if a not in exclude: flatten(x[a], name + a + '_')
        elif type(x) is list:
            i = 0
            for a in x:
                flatten(a, name + str(i) + '_')
                i += 1
        else:
            out[name[:-1]] = x

    flatten(nested_json)
    return out

然后,您可以将其应用于数据,而不受嵌套级别的限制:

新的样本数据

this_dict = {'events': [
  {'id': 142896214,
   'playerId': 37831,
   'teamId': 3157,
   'matchId': 2214569,
   'matchPeriod': '1H',
   'eventSec': 0.8935539999999946,
   'eventId': 8,
   'eventName': 'Pass',
   'subEventId': 85,
   'subEventName': 'Simple pass',
   'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}],
   'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]},
 {'id': 142896214,
   'playerId': 37831,
   'teamId': 3157,
   'matchId': 2214569,
   'matchPeriod': '1H',
   'eventSec': 0.8935539999999946,
   'eventId': 8,
   'eventName': 'Pass',
   'subEventId': 85,
   'subEventName': 'Simple pass',
   'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53},{'x': 51, 'y': 49}],
   'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}
]}

使用方法

pd.DataFrame([flatten_json(x) for x in this_dict['events']])

Out[1]:
          id  playerId  teamId  matchId matchPeriod  eventSec  eventId  \
0  142896214     37831    3157  2214569          1H  0.893554        8   
1  142896214     37831    3157  2214569          1H  0.893554        8   

  eventName  subEventId subEventName  positions_0_x  positions_0_y  \
0      Pass          85  Simple pass             51             49   
1      Pass          85  Simple pass             51             49   

   positions_1_x  positions_1_y  tags_0_id tags_0_tag_label  positions_2_x  \
0             40             53       1801         accurate            NaN   
1             40             53       1801         accurate           51.0   

   positions_2_y  
0            NaN  
1           49.0  

请注意,这段flatten_json代码不是我写的,我是从这里这里看到的,但不确定原始来源。


1
这是杰出的。 - George Hayward

16
  • 正如接受的答案中所指出的,flatten_json 可以是一个很好的选择,具体取决于 JSON 的结构以及需要如何展平结构。
    • 在这种情况下,OP希望1个事件的所有值都在单行上,因此 flatten_json 适用。
    • 如果期望结果是positions中的每个位置都有一行分开,则pandas.json_normalize 是更好的选项。
  • flatten_json 的问题在于,如果positions有很多,则events中每个事件的列数可能非常大。
  • 有关使用flatten_json的更详细解释,请参见How to flatten a nested JSON recursively, with flatten_json?
import pandas as pd

data = {'events': [{'id': 142896214,
                    'playerId': 37831,
                    'teamId': 3157,
                    'matchId': 2214569,
                    'matchPeriod': '1H',
                    'eventSec': 0.8935539999999946,
                    'eventId': 8,
                    'eventName': 'Pass',
                    'subEventId': 85,
                    'subEventName': 'Simple pass',
                    'positions': [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}],
                    'tags': [{'id': 1801, 'tag': {'label': 'accurate'}}]}]}

选项1:为events中的每个dict创建1行

# Create the DataFrame
df = pd.DataFrame.from_dict(data)
df = df['events'].apply(pd.Series)

# display(df)
          id  playerId  teamId  matchId matchPeriod  eventSec  eventId eventName  subEventId subEventName                                 positions                                          tags
0  142896214     37831    3157  2214569          1H  0.893554        8      Pass          85  Simple pass  [{'x': 51, 'y': 49}, {'x': 40, 'y': 53}]  [{'id': 1801, 'tag': {'label': 'accurate'}}]

# Flatten positions with pd.Series
df_p = df['positions'].apply(pd.Series)
df_p_0 = df_p[0].apply(pd.Series)
df_p_1 = df_p[1].apply(pd.Series)

# Rename positions[0] & positions[1]
df_p_0.columns = ['pos_0_x', 'pos_0_y']
df_p_1.columns = ['pos_1_x', 'pos_1_y']

# Flatten tags with pd.Series
df_t = df.tags.apply(pd.Series)
df_t = df_t[0].apply(pd.Series)
df_t_t = df_t.tag.apply(pd.Series)

# Rename id & label
df_t =  df_t.rename(columns={'id': 'tags_id'})
df_t_t.columns = ['tags_tag_label']

# Combine them all with `pd.concat`
df_new = pd.concat([df, df_p_0, df_p_1, df_t.tags_id, df_t_t], axis=1)

# Drop the old columns
df_new = df_new.drop(['positions', 'tags'], axis=1)

# display(df_new)
          id  playerId  teamId  matchId matchPeriod  eventSec  eventId eventName  subEventId subEventName  pos_0_x  pos_0_y  pos_1_x  pos_1_y  tags_id tags_tag_label
0  142896214     37831    3157  2214569          1H  0.893554        8      Pass          85  Simple pass       51       49       40       53     1801       accurate

选项2:为positions中的每个位置创建单独的行

# normalize events
df = pd.json_normalize(data, 'events')

# explode all columns with lists of dicts
df = df.apply(lambda x: x.explode()).reset_index(drop=True)  # df.apply(pd.Series.explode).reset_index(drop=True) also works

# list of columns with dicts
cols_to_normalize = ['positions', 'tags']

# if there are keys, which will become column names, overlap with excising column names
# add the current column name as a prefix
normalized = list()
for col in cols_to_normalize:
    
    d = pd.json_normalize(df[col], sep='_')
    d.columns = [f'{col}_{v}' for v in d.columns]
    normalized.append(d.copy())

# combine df with the normalized columns
df = pd.concat([df] + normalized, axis=1).drop(columns=cols_to_normalize)

# display(df)
          id  playerId  teamId  matchId matchPeriod  eventSec  eventId eventName  subEventId subEventName  positions_x  positions_y  tags_id tags_tag_label
0  142896214     37831    3157  2214569          1H  0.893554        8      Pass          85  Simple pass           51           49     1801       accurate
1  142896214     37831    3157  2214569          1H  0.893554        8      Pass          85  Simple pass           40           53     1801       accurate

亲爱的 Trenton,非常感谢你的建议。它起作用了。 - Zephyr
1
好答案。我运行了代码,它在5年后仍然有效。谢谢你。 - undefined

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接