将json转换为pandas DataFrame

5

我有一个JSON文件,其中包含多个对象,例如:

 {"reviewerID": "bc19970fff3383b2fe947cf9a3a5d7b13b6e57ef2cd53abc52bb2dfedf5fb1cd", "asin": "a6ed402934e3c1138111dce09256538afb04c566edf37c16b9ba099d23afb764", "overall": 2.0, "helpful": {"nHelpful": 1, "outOf": 1}, "reviewText": "This remote, for whatever reason, was chosen by Time Warner to replace their previous silver remote, the Time Warner Synergy V RC-U62CP-1.12S.  The actual function of this CLIKR-5 is OK, but the ergonomic design sets back remotes by 20 years.  The buttons are all the same, there's no separation of the number buttons, the volume and channel buttons are the same shape as the other buttons on the remote, and it all adds up to a crappy user experience.  Why would TWC accept this as a replacement?    I'm skipping this and paying double for a refurbished Synergy V.", "summary": "Ergonomic nightmare", "unixReviewTime": 1397433600}

{"reviewerID": "3689286c8658f54a2ff7aa68ce589c81f6cae4c4d9de76fa0f66d5c114f79837", "asin": "8939d791e9dd035aa58da024ace69b20d651cea4adf6159d984872b44f663301", "overall": 4.0, "helpful": {"nHelpful": 21, "outOf": 22}, "reviewText": "This is a great truck GPS. I've tried others and nothing seems to come close to the Rand McNally TND-700.Excellent screen size and resolution. The audio is loud enough to be heard over road noise and the purr of my Kenworth/Cat engine. I've used it for the last 8,000 miles or so and it has only glitched once. Just restarted it and it picked up on my route right where it should have.Clean up the minor issues and this unit rates a solid 5.Rand McNally 528881469 7-inch Intelliroute TND 700 Truck GPS", "summary": "Great Unit!", "unixReviewTime": 1280016000}

我正在尝试使用以下代码将其转换为Pandas DataFrame:
train_df = pd.DataFrame()
count = 0;
for l in open('train.json'):
    try:
        count +=1
        if(count==20001):
            break
        obj1 = json.loads(l)
        df1=pd.DataFrame(obj1, index=[0])
        train_df = train_df.append(df1, ignore_index=True)
    except ValueError:
        line = line.replace('\\','')
        obj = json.loads(line)
        df1=pd.DataFrame(obj, index=[0])
        train_df = train_df.append(df1, ignore_index=True)

然而,对于嵌套值,如 'helpful' 属性,它会给我返回 'NaN'。我想输出的结果是,嵌套属性的两个键都是单独的列。
编辑:
附言:我正在使用 try/except,因为某些对象中有 '\' 字符,这会导致 JSON 解码错误。
有人能帮忙吗?有其他方法可以使用吗?
谢谢。

你尝试过使用 pandas.read_json 吗?http://pandas.pydata.org/pandas-docs/stable/generated/pandas.read_json.html - DeepSpace
@DeepSpace 是的,我有。它给了我一个错误,说“ValueError:'trailing data'”。 - Yashvardhan Nanavati
尾随数据意味着您的文件中有额外的数据,这些数据不是JSON对象的一部分。请查看您的文件,并确保它是有效的JSON。 - RichSmith
@RichSmith 我尝试查看文件,但是文件太大了,无法在编辑器中打开。此外,当我尝试使用上面的代码时,它给了我一个数据框,但对于嵌套属性“helpful”,它只给出了“NaN”。 - Yashvardhan Nanavati
2个回答

4

使用json_normalize处理字典列表,该方法在大量JSON对象上的处理速度相对较快。

from pandas.io.json import json_normalize

my_list = []
with open('train.json') as f:
    for line in f:
        line = line.replace('\\','')
        my_list.append(json.loads(line))

# avoid transposing if you want to keep keys as columns of the dataframe
result_df = json_normalize(my_list).T

enter image description here


0

尝试:

pd.concat([pd.Series(json.loads(line)) for line in open('train.json')], axis=1)

enter image description here


1
这似乎是有效的。是否有一种方法可以将上述解决方案仅应用于前100个对象并将它们存储在单独的数据框中?该文件非常大,我无法对整个文件运行上述解决方案。还有一种方法可以在其中使用try / except吗?因为我在某些对象中有一个“\”,这会导致JsonDecodeError错误。 - Yashvardhan Nanavati

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接