将 Pandas Dataframe 中的 JSON 列切片为多列

Question

将 Pandas Dataframe 中的 JSON 列切片为多列

10

我是一位有用的助手，可以为您进行文本翻译。

我有一个数据框，其中包含两列JSON格式的数据，如下所示：

   author          biblio                                 series                             
Mehrdad Vahabi    {'volume': 68, 'month': 'January',      {'handle':'RePEc:aka:aoecon', 'name': 'Oeconomica'}
                   'name': 'János Kornai', 
                   'issue': 's', 'handle': 
                   'n:v:68:y:2018:i', 
                   'year': '2018', 
                   'pages': '27-52', 'doi': ''}
Michael Bailey     {'c_date': '2017', 'number':           {'handle': '', 'name': ''}
                    '23608', 'handle': 'RePEc:nbr:
                    nberwo:23608', 'name': 'Measuring'}

我希望我的数据框看起来像这样：

author          biblio.volume  biblio.month    biblio.name  biblio.issue  biblio.handle    bibilio.year   biblio.pages   biblio.doi  biblio.c_date    bibi¡lio.number    series.handle   series.name
Mehrdad Vahabi  68             January         János Kornai s             n:v:68:y:2018:i  2018           27-52                      NA               NA                RePEc:aka:aoecon  Oeconomica
Michael Bailey  NA             Na              Meausuring   NA            nberwo:23608     NA             NA             NA          2017        23608

我尝试使用这个问题中的答案，但都对我没有用。

我该怎么办？

[编辑] 这里是数据的样例

[编辑]

按照@jezrael的解决方案，我得到了这个：

df1 = pd.DataFrame(df['biblio'].values.tolist())
df1.columns = 'biblio.'+ df1.columns

df2 = pd.DataFrame(df['series'].values.tolist())
df2.columns = 'series.'+ df2.columns

col = df.columns.difference(['biblio','series'])
df = pd.concat([df[col], df1, df2],axis=1)
print (df)


Traceback (most recent call last):
  File "dfs.py", line 8, in <module>
    df1.columns = 'bibliographic.'+ df1.columns
  File "/Users/danielotero/anaconda3/lib/python3.6/site-
packages/pandas/core/indexes/range.py", line 583, in _evaluate_numeric_binop
    other = self._validate_for_numeric_binop(other, op, opstr)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-
packages/pandas/core/indexes/base.py", line 3961, in 
_validate_for_numeric_binop
     raise TypeError("can only perform ops with scalar values")
TypeError: can only perform ops with scalar values

使用 json_normalize 方法：

Traceback (most recent call last):
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2525, in get_loc
    return self._engine.get_loc(key)
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "dfs.py", line 7, in <module>
    df = json_normalize(d)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/io/json/normalize.py", line 192, in json_normalize
    if any([isinstance(x, dict) for x in compat.itervalues(data[0])]):
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2139, in __getitem__
    return self._getitem_column(key)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/frame.py", line 2146, in _getitem_column
    return self._get_item_cache(key)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 1842, in _get_item_cache
    values = self._data.get(item)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/internals.py", line 3843, in get
    loc = self.items.get_loc(item)
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/indexes/base.py", line 2527, in get_loc
    return self._engine.get_loc(self._maybe_cast_indexer(key))
  File "pandas/_libs/index.pyx", line 117, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/index.pyx", line 139, in pandas._libs.index.IndexEngine.get_loc
  File "pandas/_libs/hashtable_class_helper.pxi", line 1265, in pandas._libs.hashtable.PyObjectHashTable.get_item
  File "pandas/_libs/hashtable_class_helper.pxi", line 1273, in pandas._libs.hashtable.PyObjectHashTable.get_item
KeyError: 0

在参考了@Jhon H的解决方案后，我得到了如下代码：

Traceback (most recent call last):
  File "dfs.py", line 7, in <module>
    jsonSeries = df[['bibliographic']].tolist()
  File "/Users/danielotero/anaconda3/lib/python3.6/site-packages/pandas/core/generic.py", line 3614, in __getattr__
    return object.__getattribute__(self, name)
AttributeError: 'DataFrame' object has no attribute 'tolist'

- Moses

json 是 DataFrame 的输入吗？也许可以从 json 创建 df，而不需要字典的列。 - jezrael

2个回答

3

你需要单独处理每一列，然后将它们拼接在一起以得到所需的格式。这里有一个简单的例子可以供你参考。

import pandas as pd
records = [{'col1':'v1','col2':{'a1':1,'b1':1},'col3':{'c1':1,'d1':1}},
           {'col1':'v2','col2':{'a1':2,'b1':2},'col3':{'c1':2,'d1':2}}]
sample_df = pd.DataFrame(records)
sample_df


    col1      col2                col3
0   v1  {'a1': 1, 'b1': 1}  {'c1': 1, 'd1': 1}
1   v2  {'a1': 2, 'b1': 2}  {'c1': 2, 'd1': 2}


col2_expanded = sample_df.col2.apply(lambda x:pd.Series(x))
col2_expanded.columns = ['{}.{}'.format('col2',i) for i in col2_expanded]
col2_expanded


    col2.a1     col2.b1
0   1             1
1   2             2

col3_expanded = sample_df.col3.apply(lambda x:pd.Series(x))
col3_expanded.columns = ['{}.{}'.format('col3',i) for i in col3_expanded]
col3_expanded


    col3.c1     col3.d1
0   1            1
1   2            2

final = pd.concat([sample_df[['col1']],col2_expanded,col3_expanded],axis=1)
final

    col1    col2.a1     col2.b1     col3.c1     col3.d1
0   v1       1           1            1         1
1   v2       2           2            2         2

- vumaasha

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- jezrael · Accepted Answer

为每个 dict 列创建一个新的 DataFrame，使用构造函数，并通过最后的 concat 组合在一起：

df1 = pd.DataFrame(df['biblio'].values.tolist())
df1.columns = 'biblio.'+ df1.columns

df2 = pd.DataFrame(df['series'].values.tolist())
df2.columns = 'series.'+ df2.columns

col = df.columns.difference(['biblio','series'])
df = pd.concat([df[col], df1, df2],axis=1)
print (df)
           author biblio.c_date biblio.doi           biblio.handle  \
0  Mehrdad Vahabi           NaN                    n:v:68:y:2018:i   
1  Michael Bailey          2017        NaN  RePEc:nbr:nberwo:23608   

  biblio.issue biblio.month   biblio.name biblio.number biblio.pages  \
0            s      January  Janos Kornai           NaN        27-52   
1          NaN          NaN     Measuring         23608          NaN   

   biblio.volume biblio.year     series.handle series.name  
0           68.0        2018  RePEc:aka:aoecon  Oeconomica  
1            NaN         NaN

编辑：

如果输入格式为json，可以使用json_normalize：

from pandas.io.json import json_normalize

d = [{"author":"Mehrdad Vahabi","biblio":{"volume":68,"month":"January","name":"Janos Kornai","issue":"s","handle":"n:v:68:y:2018:i","year":"2018","pages":"27-52","doi":""},"series":{"handle":"RePEc:aka:aoecon","name":"Oeconomica"}},{"author":"Michael Bailey","biblio":{"c_date":"2017","number":"23608","handle":"RePEc:nbr:nberwo:23608","name":"Measuring"},"series":{"handle":"","name":""}}]

df = json_normalize(d)
print (df)
           author biblio.c_date biblio.doi           biblio.handle  \
0  Mehrdad Vahabi           NaN                    n:v:68:y:2018:i   
1  Michael Bailey          2017        NaN  RePEc:nbr:nberwo:23608   

  biblio.issue biblio.month   biblio.name biblio.number biblio.pages  \
0            s      January  Janos Kornai           NaN        27-52   
1          NaN          NaN     Measuring         23608          NaN   

   biblio.volume biblio.year     series.handle series.name  
0           68.0        2018  RePEc:aka:aoecon  Oeconomica  
1            NaN         NaN

编辑：您的字典是字符串，因此首先需要使用 ast.literal_eval 进行转换：

import ast

df = pd.read_csv('probe.csv')
#print (df)

df1 = pd.DataFrame(df['bibliographic'].apply(ast.literal_eval).values.tolist())
df1.columns = 'bibliographic.'+ df1.columns

df2 = pd.DataFrame(df['series'].apply(ast.literal_eval).values.tolist())
df2.columns = 'series.'+ df2.columns

col = df.columns.difference(['bibliographic','series'])
df = pd.concat([df[col], df1, df2],axis=1)