将JSON文件转换为Pandas数据帧

Question

将JSON文件转换为Pandas数据帧

5

我可以帮您将JSON转换为Pandas数据框。以下是需要翻译的内容：

我想要将一个JSON转换为Pandas数据框。

我的JSON长这样：

{ 
   "country1":{ 
      "AdUnit1":{ 
         "floor_price1":{ 
            "feature1":1111,
            "feature2":1112
         },
         "floor_price2":{ 
            "feature1":1121
         }
      },
      "AdUnit2":{ 
         "floor_price1":{ 
            "feature1":1211
         },
         "floor_price2":{ 
            "feature1":1221
         }
      }
   },
   "country2":{ 
      "AdUnit1":{ 
         "floor_price1":{ 
            "feature1":2111,
            "feature2":2112
         }
      }
   }
}

我使用以下代码从GCP读取文件：

project = Context.default().project_id
sample_bucket_name = 'my_bucket'
sample_bucket_path = 'gs://' + sample_bucket_name
print('Object: ' + sample_bucket_path + '/json_output.json')

sample_bucket = storage.Bucket(sample_bucket_name)
sample_bucket.create()
sample_bucket.exists()

sample_object = sample_bucket.object('json_output.json')
list(sample_bucket.objects())
json = sample_object.read_stream()

我的目标是获取类似于以下样式的Pandas dataframe：

我尝试使用json_normalize，但没有成功。

- Alexandr Fruman

pd.read_json 怎么样？ - Nicolas Gervais

我尝试了一下，但结果不太好：https://c2n.me/44pYvfb - Alexandr Fruman

2

看一下这个答案，我认为你需要先“展平”json，才能使用pd.read_json(json.dumps(json_dictionary))。 - Zionsof

4个回答

1

你可以使用这个：

def flatten_dict(d):
    """ Returns list of lists from given dictionary """
    l = []
    for k, v in sorted(d.items()):
        if isinstance(v, dict):
            flatten_v = flatten_dict(v)
            for my_l in reversed(flatten_v):
                my_l.insert(0, k)

            l.extend(flatten_v)

        elif isinstance(v, list):
            for l_val in v:
                l.append([k, l_val])

        else:
            l.append([k, v])

    return l

这个函数接收一个字典（包括嵌套，其中值也可以是列表），并将其展平为一个列表的列表。

然后，您只需：

df = pd.DataFrame(flatten_dict(my_dict))

其中my_dict是您的JSON对象。以您的示例为例，运行print(df)时得到的结果为：

          0        1             2         3     4
0  country1  AdUnit1  floor_price1  feature1  1111
1  country1  AdUnit1  floor_price1  feature2  1112
2  country1  AdUnit1  floor_price2  feature1  1121
3  country1  AdUnit2  floor_price1  feature1  1211
4  country1  AdUnit2  floor_price2  feature1  1221
5  country2  AdUnit1  floor_price1  feature1  2111
6  country2  AdUnit1  floor_price1  feature2  2112

当您创建数据帧时，可以为列和索引命名。

- Zionsof

1

您可以尝试这种方法：

 from google.cloud import storage
 import pandas as pd

 storage_client = storage.Client()
 bucket = storage_client.get_bucket('test-mvladoi')
 blob = bucket.blob('file')
 read_output = blob.download_as_string()
 data = json.loads(read_output)

 data_norm = json_normalize(data, max_level=5)
 df = pd.DataFrame(columns=['col1', 'col2', 'col3', 'col4', 'col5'])
 i = 0

 for col in b.columns:
     a,c,d,e = col.split('.')
     df.loc[i]  = [a,c,d,e,b[col][0]]
     i = i + 1

 print(df)

- marian.vladoi

我在Google Data Lab工作，from google.cloud import storage存在一些问题。 - Alexandr Fruman

尝试过了。 AttributeError: 模块 'google.datalab.storage' 没有 'Client' 属性。 - Alexandr Fruman

1

它们以不同的方式处理存储：mybucket = storage.Bucket('BUCKET_NAME')，blob = mybucket.object('file') 在互联网上查找。 - marian.vladoi

0

不是最好的方法，但它起作用。还应修改展平函数，该函数仅从此 awnser 中选择。

test = { 
   "country1":{ 
      "AdUnit1":{ 
         "floor_price1":{ 
            "feature1":1111,
            "feature2":1112
         },
         "floor_price2":{ 
            "feature1":1121
         }
      },
      "AdUnit2":{ 
         "floor_price1":{ 
            "feature1":1211
         },
         "floor_price2":{ 
            "feature1":1221
         }
      }
   },
   "country2":{ 
      "AdUnit1":{ 
         "floor_price1":{ 
            "feature1":2111,
            "feature2":2112
         }
      }
   }
}

from collections import defaultdict
import pandas as pd
import collections

def flatten(d, parent_key='', sep='_'):
    items = []
    for k, v in d.items():
        new_key = parent_key + sep + k if parent_key else k
        if isinstance(v, collections.MutableMapping):
            items.extend(flatten(v, new_key, sep=sep).items())
        else:
            items.append((new_key, v))
    return dict(items)

results = defaultdict(list)   
colnames = ["col1", "col2", "col3", "col4", "col5", "col6"]
for key, value in flatten(test).items():
    elements = key.split("_")
    elements.append(value)
    for colname, element in zip(colnames, elements):
        results[colname].append(element)

df = pd.DataFrame(results)
print(df)

- Florian Bernard

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Luc Bertin · Accepted Answer

处理嵌套的JSON始终是相当棘手的。

几个月前，我找到了一种使用优雅编写的flatten_json_iterative_solution从这里提供“通用答案”的方法：迭代地展开给定json的每个级别。

然后，可以将其简单地转换为Pandas.Series，然后再转换为Pandas.DataFrame，如下所示：

df = pd.Series(flatten_json_iterative_solution(dict(json_))).to_frame().reset_index()

中间数据框结果

一些数据转换可以轻松地执行，以将您所要求的列名称中的索引拆分为列名：

df[["index", "col1", "col2", "col3", "col4"]] = df['index'].apply(lambda x: pd.Series(x.split('_')))

最终结果