使用Python将嵌套的JSON拆分为两个/多个文件

Question

使用Python将嵌套的JSON拆分为两个/多个文件

3

我有一个嵌套的JSON文件，大小为180MB，其中包含280000多个条目。我的JSON文件数据如下：

{ 
"images": [
     {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}, 
     {"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae_a", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}
  ],
"annotations": [
    {"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}, 
    {"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0},
    {"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}, 
    {"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
  ]
}

请注意，所有的JSON数据都在一行中，我将其分成4行以便更好地阅读。

我的问题是如何将这个JSON文件的数据拆分为小文件甚至两个文件？因为我的JSON文件是嵌套的，有两个主要类别images和annotations。在拆分后的文件中，该文件的层次结构应与上述相同（即images和annotations必须与相同ID存储在一个文件中）。

例如：根据上面的JSON数据，images有4个条目，annotations也有4个条目，在将其拆分/分割成两个文件后，新生成文件中的数据应如下所示（每个新生成文件中images和annotations各有2个条目）

JSON文件_1的数据：

{ 
"images": [
     {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"} 
  ],
"annotations": [
     {"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}, 
     {"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}
  ]
}

JSON文件_2数据

{ 
"images": [
     {"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"} 
  ],
"annotations": [
     {"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}, 
     {"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
  ]
}

我在stackoverflow和github上查看了许多问题，但都无法解决我的问题。有些解决方案适用，但不适用于嵌套的json数据。

这是在github上的json-splitter，它不能处理嵌套的json。

另一个stackoverflow的问题可以解决，但仅适用于小文件，因为很难提供特定的ID或数据来逐个删除条目。

我尝试了来自这个github帖子中的下面代码。

with open(sys.argv[1],'r') as infile:
    o = json.load(infile)
    chunkSize = 4550
    for i in xrange(0, len(o), chunkSize):
        with open(sys.argv[1] + '_' + str(i//chunkSize) + '.json', 'w') as outfile:
            json.dump(o[i:i+chunkSize], outfile)

但是，这仍然无法解决我的问题。我错过了什么？我知道关于这个问题有很多问答，但是由于嵌套数据，在我的情况下没有任何解决方案能够工作。我是Python的新手，经过了很多工作后，我无法解决我的问题。寻找有价值的建议和解决方案。谢谢

- Erric

我的问题是如何将这个JSON文件数据分割成小文件或者两个文件？你想要分割成哪种方式呢？你可以使用“key”迭代器来遍历键。 - Swedgin

1

一些阅读材料：https://realpython.com/iterate-through-dictionary-python/ - Swedgin

抱歉，Erric，我不明白你的意思。根据你的评论，你想从1个文件分成4个文件？50% 图像，50% 图像，50% 注释，50% 注释？这个“他拆分了嵌套数据，但我想拆分我的带有嵌套数据的json文件”很奇怪，我不知道你想做什么。我建议你制作一个有限的json对象（比如在图像和注释中只有2-4个项目），并展示你想要如何拆分它。（并使用适当的缩进来表示json对象） - Swedgin

1

@Swedgin，感谢您的建议。我已经更新了我的问题，并提供了详细信息和示例。希望这能清楚地表达我的观点。 - Erric

@Swedgin 这些数据只是一个例子，我的主要问题是如何进行分割？你可以说这些数据是正确的，因为我是从文件中复制的。 - Erric

显示剩余2条评论

2个回答

1

我添加了打印语句，以便您知道代码在哪个步骤上，因为它可能需要一些时间来执行。

import json

print("start")

with open("YOURFILE.json", "r") as f:
    data = json.load(f)

print("loaded")

with open("images.json", "w") as f:
    json.dump(data["images"], f)

print("copied images")

with open("annotations.json", "w") as f:
    json.dump(data["annotations"], f)

print("finished")

- Contrean

1

将嵌套数据拆分为单个数据是一个不错的想法，但正如我上面提到的，我想要在每个新文件中拆分具有嵌套数据的数据。你可以说在新文件中有50%的“图像”数据和50%的“注释”数据。 - Erric

网页内容由stack overflow 提供, 点击上面的

可以查看英文原文，
原文链接

- balderman · Accepted Answer

下面的代码将为您执行拆分。

import json

d = {
    "images": [
        {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae",
         "width": 640, "height": 480, "priority": "high"},
        {"id": 5, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"},
        {"id": 7, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae",
         "width": 640, "height": 480, "priority": "high"},
        {"id": 9, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"},
        {"id": 99, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"}
    ],
    "annotations": [{"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 1},
                    {"id": 5, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
                    {"id": 7, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 1},
                    {"id": 9, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
                    {"id": 99, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
                    ]
}

NUM_OF_ENTRIES_IN_FILE = 2
counter = 0
# assuming the images and annotations lists sorted with the same ids
while (counter + 1) * NUM_OF_ENTRIES_IN_FILE <= len(d['images']):
    temp = {'images': d['images'][counter * NUM_OF_ENTRIES_IN_FILE: (counter + 1) * NUM_OF_ENTRIES_IN_FILE],
            'annotations': d['annotations'][counter * NUM_OF_ENTRIES_IN_FILE: (counter + 1) * NUM_OF_ENTRIES_IN_FILE]}
    with open(f'out_{counter}.json', 'w') as f:
        json.dump(temp, f)
    counter += 1
reminder = len(d['images']) % NUM_OF_ENTRIES_IN_FILE
if reminder > 0:
    reminder = reminder * -1
    counter += 1
    temp = {'images': d['images'][reminder:],
            'annotations': d['annotations'][reminder:]}
    with open(f'out_{counter}.json', 'w') as f:
        json.dump(temp, f)