Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to parse or control the control-flow of json stream in golang?

Background

I have big(2Gib<myfile<10GiB) json files that I need to parse. Due to the size of the file, I cannot keep it as a variable and unmarshal it as I need.

This is why I am trying to use json.NewDecoder as shown in the example here.

A bit about data

The data I have is kinda like the followng,

{
   "key1" : [ "hundreads_of_nested_objects" ],
   "key2" : [ "hundreads_of_nested_objects" ],
   "unknown_unexpected_key" : [ "many_nested_objects" ],
   ........
   "keyN" : [ n_objects ]
}

The Code I am trying to use

    file, err := os.Open("myfile.json")
    dec := json.NewDecoder(file)
    for {
        t, err := dec.Token()
        if err == io.EOF {
            break
        }
        if err != nil {
            log.Fatal(err)
        }
        fmt.Printf("%T: %v", t, t)
        }

Problem Statement

  1. How should I approach this kind of data structure with the json.NewDecoder?
  2. What are the best practices to deal with this kind of problem?
  3. Pointing out any existing similar code would be helpful.

To clarify, a few use cases could be the followings,

  • Parse only key1 or keyN instead of the whole file.
  • Grab only a specific key and find some nested object of that key.
  • Dump the contents of keys or some objects inside them to another file.

[N.B] I am new to development, my question might be too broad. Any guidance to improve it would be helpful too.

like image 769
arif Avatar asked Oct 28 '25 03:10

arif


1 Answers

Use a steaming decoder

For starters, when you use json.Unmarshal you provide it all the bytes of the json input, so you need to read the entire source file before you can even start to allocate memory for your Go representation of the data.

I hardly ever use json.Unmarshal. Use json.NewDecoder like this, which will stream the data into the unmarshaler bit by bit.

You'd still have to fit all of the data representation in memory (at least the parts you modeled), but depending on the data in the Json, that might be quite a bit less memory than the json representation.

For example, in JSON, numbers are represented as strings of digit characters, but can often fit into much smaller int or float64 types. booleans and nulls are also much bigger in Json than their Go representation. The structure characters []{}: probably won't require as much space in go in-memory types. And of course any whitespace in the Json does nothing but make the file larger. (I'd recommend minifying the json to remove unnecessary whitespace, but that's only relevant for storage and won't have much effect once you're streaming the json data).

Model your data with structs and omit as much data from your model as possible

If there's a lot of json data that isn't relevant to your operation, omit them from your models.

You can't do this if you're letting the decoder decode into generic types like map[string]interface{} . But if you use the struct based decoding mechanisms, you can specify which fields you want to store, which might significantly decrease the size of your representation in memory. Combined with the streaming decoder that might solve your memory constraint.

Clearly some of your data has unknown keys, so you can't store all of it in a structure. If you can't sufficiently define the structures that match your data, this option's off the table.

Add memory or swap to your system

Memory is cheap these days, and disk space for swap is even cheaper. If you can get away with adding either to alleviate your memory constraint, so you can fit all the data's representation in memory, that's by far the simplest solution to your problem.

Convert data to a format that can be more efficiently accessed in storage

Json is a great format and one of my favorites. But it's not very good for storing large volumes of data because it's cumbersome to access subsets of that data at a time.

Formats like parquet store data in a way that makes the underlying storage much more efficient to query and navigate. This means that instead of reading all the data and then representing it all in memory, you can read the parts of the data you want directly from the on-disk storage.

The same would be true of a well indexed SQL or NoSQL database.

You could even break your data down into multiple json files that you could read and process sequentially.

Last resort, implement your own scanning functionality

If you really can't (or won't) add memory or swap space, change the format of the data, or break it into smaller parts, then you have to write a json scanner that can keep track of your location in the json file so that you know how to process the data. You won't have a representation of all the data at once, but you might be able to pick out the pieces you need without having to store a representation of all the data at once.

It's complicated though, and would be specific to the task in question, there's no generic answer for how to do it.

like image 80
Daniel Farrell Avatar answered Oct 30 '25 05:10

Daniel Farrell



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!