使用多个JSON对象的JSON文件进行加载和解析

Question

使用多个JSON对象的JSON文件进行加载和解析

149

我正在尝试在 Python 中加载和解析一个 JSON 文件。但是我卡在了加载文件的过程中：

import json
json_data = open('file')
data = json.load(json_data)

输出：

ValueError: Extra data: line 2 column 1 - line 225116 column 1 (char 232 - 160128774)

我查看了Python文档中的18.2. json — JSON编码器和解码器，但是这份文档看起来非常糟糕，让人感到沮丧。

前几行（使用随机条目进行匿名处理）：

{"votes": {"funny": 2, "useful": 5, "cool": 1}, "user_id": "harveydennis", "name": "Jasmine Graham", "url": "http://example.org/user_details?userid=harveydennis", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 2, "cool": 4}, "user_id": "njohnson", "name": "Zachary Ballard", "url": "https://www.example.com/user_details?userid=njohnson", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 0, "cool": 4}, "user_id": "david06", "name": "Jonathan George", "url": "https://example.com/user_details?userid=david06", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 6, "useful": 5, "cool": 0}, "user_id": "santiagoerika", "name": "Amanda Taylor", "url": "https://www.example.com/user_details?userid=santiagoerika", "average_stars": 3.5, "review_count": 12, "type": "user"}
{"votes": {"funny": 1, "useful": 8, "cool": 2}, "user_id": "rodriguezdennis", "name": "Jennifer Roach", "url": "http://www.example.com/user_details?userid=rodriguezdennis", "average_stars": 3.5, "review_count": 12, "type": "user"}

- Pi_

6个回答

27

如果您使用的是pandas，并且您想将json文件加载为数据框，则可以使用：

如果你正在使用 pandas 并且你有兴趣把 json 文件载入成一个 dataframe ，你可以使用以下代码：

import pandas as pd
df = pd.read_json('file.json', lines=True)

要将其转换为JSON数组，您可以使用以下代码：

df.to_json('new_file.json')

- mcgusty

2

在我看来，这个答案是最符合Python风格的。 - Green

20

对于那些偶然遇到这个问题的人：Python的jsonlines库（比这个问题要年轻得多）优雅地处理每行一个JSON文档的文件。见https://jsonlines.readthedocs.io/

- wouter bolsterlee

0

这个格式不正确。您每行有一个JSON对象，但它们没有包含在更大的数据结构中（即数组）。您需要重新格式化它，使其以[开头并以]结尾，每行末尾都有一个逗号，或者将其逐行解析为单独的字典。

- Daniel Roseman

24

使用一个大小为50MB的文件，可能最好逐行处理数据。 :-) - Martijn Pieters

20

文件格式是否不规范取决于个人观点。如果它的目的是为了符合“JSON lines”格式，那么它就是有效的。请参阅：http://jsonlines.org/ - Mr. Lance E Sloan

我喜欢浏览器一次性丢弃2500MB的方式，而人们却不想使用50MB来处理实际的东西。 - doug65536

0

在 @arunppsg 的回答基础上，添加了多进程处理目录中大量文件的功能。

import numpy as np
import pandas as pd
import json
import os
import multiprocessing as mp
import time

directory = 'your_directory'

def read_json(json_files):
    df = pd.DataFrame()
    for j in json_files:
        with open(os.path.join(directory, j)) as f:
            df = df.append(pd.read_json(f, lines=True)) # if there's multiple lines in the json file, flag lines to true, false otherwise.
    return df

def parallelize_json(json_files, func):
    json_files_split = np.array_split(json_files, 10)
    pool = mp.Pool(mp.cpu_count())
    df = pd.concat(pool.map(func, json_files_split))
    pool.close()
    pool.join()
    return df

# start the timer
start = time.time()

# read all json files in parallel
df = parallelize_json(json_files, read_json)

# end the timer
end = time.time()

# print the time taken to read all json files
print(end - start)

- Angus

0

就像Martijn Pieters的回答一样，但可能更加Pythonic，并且最重要的是，它可以实现数据流传输（请参见回答的第二部分）。

import json

with open(filepath, "r") as f:
    return list(map(json.loads, f))

map(function, iterable)函数返回一个迭代器，它将function应用于iterable的每个项，产生结果（参见map() python doc）。
而list将此迭代器转换为列表 :)
但是你可以想象直接使用map返回的迭代器：它遍历每个JSON行。请注意，在这种情况下，您需要在with open(filepath, "r") as f上下文中进行操作：这就是这种方法的优势，JSON行不会完全加载到列表中，而是以流式传输的方式读取：当for循环调用next(iterator)时，map函数会读取文件的每一行。
它将提供：

import json

with open(file path, "r") as f:
    iterator_over_lines = map(json.loads, f)
    # just as you would do with a list but here the file is streamed
    for jsonline in iterator_over_lines:
         # do something for each line
    # the function mapped, json.loads is only call on each iteration
    # that's why the file must stay opened

    # You can try to call yourself the next function used by the for loop:
    next_jsonline = next(iterator_over_lines)
    nextnext_jsonline = next(iterator_over_lines)

我对Martijn关于jsonl（逐行json文件）是什么以及为什么使用它的解释没有任何补充！

- Ken

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- Martijn Pieters · Accepted Answer

你有一个JSON Lines格式文本文件。你需要逐行解析该文件：

import json

data = []
with open('file') as f:
    for line in f:
        data.append(json.loads(line))

每行都包含有效的JSON，但作为一个整体，它不是一个有效的JSON值，因为没有顶层列表或对象定义。

请注意，由于该文件每行包含一个JSON对象，您可以避免尝试一次性解析所有内容或尝试使用流式JSON解析器。您现在可以选择在移动到下一行之前单独处理每一行，从而在过程中节省内存。如果您的文件非常大，您可能不想将每个结果附加到一个列表中，然后再处理所有内容。

如果您有一个包含个别JSON对象和分隔符的文件，请使用如何使用“json”模块读取一个JSON对象一次？来使用缓冲方法解析单个对象。