Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Split nested json into two/multiple file using python

Tags:

python

json

I have nested json file and its size is 180MB having upto 280000 entries. My json file data looks like

{ 
"images": [
     {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}, 
     {"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae_a", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"}
  ],
"annotations": [
    {"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}, 
    {"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0},
    {"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}, 
    {"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
  ]
}

Note that all the json data is in one line, I posted it in 4 lines for better reading.

My question is that how I can split or divide this json file data into small files or even two files? As my json file is nested having two main category images and annotations. The hierarchy of this file should be same as above in divided files (means images and annotations must be store along with same ID in one file).

For Example: By following above json data, that have 4 entries for images and also 4 entries for annotations, after splitting/dividing into two files the new data in json files should be as given below (2 entries for images and also 2 entries for annotations in each new generated file)

JSON file_1 data:

{ 
"images": [
     {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 1, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"} 
  ],
"annotations": [
     {"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}, 
     {"id": 1, "image_id": 1, "bbox": [52.56565, 313.75443, 342.73315, 206.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 0}
  ]
}

JSON file_2 data

{ 
"images": [
     {"id": 2, "img_name": "animal.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae", "width": 640, "height": 480, "priority": "high"}, 
     {"id": 3, "img_name": "plant.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish", "width": 640, "height": 480, "priority": "low"} 
  ],
"annotations": [
     {"id": 2, "image_id": 2, "bbox": [72.56565, 713.75443, 742.73315, 706.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}, 
     {"id": 3, "image_id": 3, "bbox": [12.56565, 113.75443, 142.73315, 106.09524], "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
  ]
}

I checked many questions on stackoverflow and github but unable to solve my problem. Some solutions are exist but not for nested json data.

Here is json-splitter on github, it can't work for nested json.

Another question on stackoverflow, it can work but only for small files because it is very difficult to provide specific ID or data to delete entries one by one.

I tried below code from this github post

with open(sys.argv[1],'r') as infile:
    o = json.load(infile)
    chunkSize = 4550
    for i in xrange(0, len(o), chunkSize):
        with open(sys.argv[1] + '_' + str(i//chunkSize) + '.json', 'w') as outfile:
            json.dump(o[i:i+chunkSize], outfile)

but again it can't solve my problem. Where I'm missing something? I know there are many questions and answer about this problem but none of any solution is working in my case because of nested data. I'm new in Python so after a lot of work I'm unable to solve my problem. Looking for valuable suggestions and solutions. Thanks

like image 246
Erric Avatar asked Mar 30 '26 15:03

Erric


1 Answers

The code below will do the split for you.

import json

d = {
    "images": [
        {"id": 0, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae",
         "width": 640, "height": 480, "priority": "high"},
        {"id": 5, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"},
        {"id": 7, "img_name": "abc.jpg", "category": "plants", "sub-catgory": "sea-plants", "object_name": "algae",
         "width": 640, "height": 480, "priority": "high"},
        {"id": 9, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"},
        {"id": 99, "img_name": "xyz.jpg", "category": "animals", "sub-catgory": "sea-animals", "object_name": "fish",
         "width": 640, "height": 480, "priority": "low"}
    ],
    "annotations": [{"id": 0, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 1},
                    {"id": 5, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
                    {"id": 7, "image_id": 0, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "right", "camera_valid": 1},
                    {"id": 9, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1},
                    {"id": 99, "image_id": 5, "bbox": [42.56565, 213.75443, 242.73315, 106.09524],
                     "joints_valid": [[1], [1], [1], [1], [0], [0]], "camera": "left", "camera_valid": 1}
                    ]
}

NUM_OF_ENTRIES_IN_FILE = 2
counter = 0
# assuming the images and annotations lists sorted with the same ids
while (counter + 1) * NUM_OF_ENTRIES_IN_FILE <= len(d['images']):
    temp = {'images': d['images'][counter * NUM_OF_ENTRIES_IN_FILE: (counter + 1) * NUM_OF_ENTRIES_IN_FILE],
            'annotations': d['annotations'][counter * NUM_OF_ENTRIES_IN_FILE: (counter + 1) * NUM_OF_ENTRIES_IN_FILE]}
    with open(f'out_{counter}.json', 'w') as f:
        json.dump(temp, f)
    counter += 1
reminder = len(d['images']) % NUM_OF_ENTRIES_IN_FILE
if reminder > 0:
    reminder = reminder * -1
    counter += 1
    temp = {'images': d['images'][reminder:],
            'annotations': d['annotations'][reminder:]}
    with open(f'out_{counter}.json', 'w') as f:
        json.dump(temp, f)
like image 100
balderman Avatar answered Apr 02 '26 05:04

balderman



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!