使用Python将嵌套的JSON拆分为两个/多个文件

3
我有一个嵌套的JSON文件,大小为180MB,其中包含280000多个条目。 我的JSON文件数据如下:
{ 
"images": [
     {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}, 
     {"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae_a", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}
  ],
"annotations": [
    {"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}, 
    {"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0},
    {"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}, 
    {"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
  ]
}
请注意,所有的JSON数据都在一行中,我将其分成4行以便更好地阅读。

我的问题是如何将这个JSON文件的数据拆分为小文件甚至两个文件?因为我的JSON文件是嵌套的,有两个主要类别imagesannotations。在拆分后的文件中,该文件的层次结构应与上述相同(即imagesannotations必须与相同ID存储在一个文件中)。

例如:根据上面的JSON数据,images有4个条目,annotations也有4个条目,在将其拆分/分割成两个文件后,新生成文件中的数据应如下所示(每个新生成文件中imagesannotations各有2个条目)

JSON文件_1的数据:

{ 
"images": [
     {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"} 
  ],
"annotations": [
     {"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}, 
     {"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}
  ]
}

JSON文件_2数据

{ 
"images": [
     {"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"} 
  ],
"annotations": [
     {"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}, 
     {"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
  ]
}

我在stackoverflow和github上查看了许多问题,但都无法解决我的问题。 有些解决方案适用,但不适用于嵌套的json数据。

这是在github上的json-splitter,它不能处理嵌套的json。

另一个stackoverflow的问题可以解决,但仅适用于小文件,因为很难提供特定的ID或数据来逐个删除条目。

我尝试了来自这个github帖子中的下面代码。

with open(sys.argv[1],'r') as infile:
    o = json.load(infile)
    chunkSize = 4550
    for i in xrange(0, len(o), chunkSize):
        with open(sys.argv[1] + '_' + str(i//chunkSize) + '.json', 'w') as outfile:
            json.dump(o[i:i+chunkSize], outfile)

但是,这仍然无法解决我的问题。我错过了什么?我知道关于这个问题有很多问答,但是由于嵌套数据,在我的情况下没有任何解决方案能够工作。我是Python的新手,经过了很多工作后,我无法解决我的问题。寻找有价值的建议和解决方案。谢谢


我的问题是如何将这个JSON文件数据分割成小文件或者两个文件?你想要分割成哪种方式呢?你可以使用“key”迭代器来遍历键。 - Swedgin
1
一些阅读材料:https://realpython.com/iterate-through-dictionary-python/ - Swedgin
抱歉,Erric,我不明白你的意思。根据你的评论,你想从1个文件分成4个文件?50% 图像,50% 图像,50% 注释,50% 注释?这个“他拆分了嵌套数据,但我想拆分我的带有嵌套数据的json文件”很奇怪,我不知道你想做什么。我建议你制作一个有限的json对象(比如在图像和注释中只有2-4个项目),并展示你想要如何拆分它。(并使用适当的缩进来表示json对象) - Swedgin
1
@Swedgin,感谢您的建议。我已经更新了我的问题,并提供了详细信息和示例。希望这能清楚地表达我的观点。 - Erric
@Swedgin 这些数据只是一个例子,我的主要问题是如何进行分割?你可以说这些数据是正确的,因为我是从文件中复制的。 - Erric
显示剩余2条评论
2个回答

2
下面的代码将为您执行拆分。
import json

d = {
    "images": [
        {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae",
         "width": 640, "height": 480, "priority": "high"},
        {"id": 5, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"},
        {"id": 7, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae",
         "width": 640, "height": 480, "priority": "high"},
        {"id": 9, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"},
        {"id": 99, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"}
    ],
    "annotations": [{"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 1},
                    {"id": 5, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
                    {"id": 7, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 1},
                    {"id": 9, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
                    {"id": 99, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
                    ]
}

NUM_OF_ENTRIES_IN_FILE = 2
counter = 0
# assuming the images and annotations lists sorted with the same ids
while (counter + 1) * NUM_OF_ENTRIES_IN_FILE <= len(d['images']):
    temp = {'images': d['images'][counter * NUM_OF_ENTRIES_IN_FILE: (counter + 1) * NUM_OF_ENTRIES_IN_FILE],
            'annotations': d['annotations'][counter * NUM_OF_ENTRIES_IN_FILE: (counter + 1) * NUM_OF_ENTRIES_IN_FILE]}
    with open(f'out_{counter}.json', 'w') as f:
        json.dump(temp, f)
    counter += 1
reminder = len(d['images']) % NUM_OF_ENTRIES_IN_FILE
if reminder > 0:
    reminder = reminder * -1
    counter += 1
    temp = {'images': d['images'][reminder:],
            'annotations': d['annotations'][reminder:]}
    with open(f'out_{counter}.json', 'w') as f:
        json.dump(temp, f)

尽管仅包含代码的答案可能会回答问题,但您可以通过为代码提供上下文、解释代码工作原理的原因以及一些参考文献来显著提高答案的质量。从 [答案] 来看:“简洁是可以接受的,但更全面的解释更好。” - Pranav Hosangadi
@balderman,我不明白你为什么要添加NUM_OF_ENTRIES_IN_FILEcounter。如果我运行这段代码,我会得到3个具有相同数据的JSON文件。此外,你的想法是手动创建新的JSON文件,就像我需要从原始文件中复制一些数据,然后在新文件中使用...感谢你的想法,我刚刚使用了你的想法来创建新的测试文件,因为我只想要一些原始文件中的数据以节省处理时间等... 这段代码已经足够了:temp = {'images': d['images'], 'annotations': d['annotations']},然后只需添加with open(f"file.json", "w")...... - Erric
3个文件中的数据不同,请查看ID。 - balderman

1

我添加了打印语句,以便您知道代码在哪个步骤上,因为它可能需要一些时间来执行。

import json

print("start")

with open("YOURFILE.json", "r") as f:
    data = json.load(f)

print("loaded")

with open("images.json", "w") as f:
    json.dump(data["images"], f)

print("copied images")

with open("annotations.json", "w") as f:
    json.dump(data["annotations"], f)

print("finished")

1
将嵌套数据拆分为单个数据是一个不错的想法,但正如我上面提到的,我想要在每个新文件中拆分具有嵌套数据的数据。你可以说在新文件中有50%的“图像”数据和50%的“注释”数据。 - Erric

网页内容由stack overflow 提供, 点击上面的
可以查看英文原文,
原文链接