我有一个嵌套的JSON文件,大小为180MB,其中包含280000多个条目。
我的JSON文件数据如下:
{
"images": [
{"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"},
{"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"},
{"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae_a", "width": 640, "height": 480, "priority": "high"},
{"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}
],
"annotations": [
{"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0},
{"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0},
{"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
{"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
]
}
请注意,所有的JSON数据都在一行中,我将其分成4行以便更好地阅读。
我的问题是如何将这个JSON文件的数据拆分为小文件甚至两个文件?因为我的JSON文件是嵌套的,有两个主要类别images
和annotations
。在拆分后的文件中,该文件的层次结构应与上述相同(即images
和annotations
必须与相同ID存储在一个文件中)。
例如:根据上面的JSON数据,images
有4个条目,annotations
也有4个条目,在将其拆分/分割成两个文件后,新生成文件中的数据应如下所示(每个新生成文件中images
和annotations
各有2个条目)
JSON文件_1的数据:
{
"images": [
{"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"},
{"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}
],
"annotations": [
{"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0},
{"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}
]
}
JSON文件_2数据
{
"images": [
{"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"},
{"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}
],
"annotations": [
{"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
{"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
]
}
我在stackoverflow和github上查看了许多问题,但都无法解决我的问题。 有些解决方案适用,但不适用于嵌套的json数据。
这是在github上的json-splitter,它不能处理嵌套的json。
另一个stackoverflow的问题可以解决,但仅适用于小文件,因为很难提供特定的ID或数据来逐个删除条目。
我尝试了来自这个github帖子中的下面代码。
with open(sys.argv[1],'r') as infile:
o = json.load(infile)
chunkSize = 4550
for i in xrange(0, len(o), chunkSize):
with open(sys.argv[1] + '_' + str(i//chunkSize) + '.json', 'w') as outfile:
json.dump(o[i:i+chunkSize], outfile)
但是,这仍然无法解决我的问题。我错过了什么?我知道关于这个问题有很多问答,但是由于嵌套数据,在我的情况下没有任何解决方案能够工作。我是Python的新手,经过了很多工作后,我无法解决我的问题。寻找有价值的建议和解决方案。谢谢