Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Performance optimal way to serialise Python objects containing large Pandas DataFrames

I am dealing with Python objects containing Pandas DataFrame and Numpy Series objects. These can be large, several millions of rows.

E.g.


@dataclass
class MyWorld:
     # A lot of DataFrames with millions of rows
     samples: pd.DataFrame 
     addresses: pd.DataFrame 
     # etc.

I need to cache these objects, and I am hoping to find an efficient and painless way to serialise them, instead of standard pickle.dump(). Are there any specialised Python serialisers for such objects that would pickle Series data with some efficient codec and compression automatically? Alternatively, I need to hand construct several Parquet files, but that requires a lot of more manual code to deal with this, and I'd rather avoid that if possible.

Performance here may mean

  • Speed
  • File size (can be related, as you need to read less from the disk/network)

I am aware of joblib.dump() which does some magic for these kind of objects, but based on the documentation I am not sure if this is relevant anymore.

like image 254
Mikko Ohtamaa Avatar asked Sep 18 '25 00:09

Mikko Ohtamaa


2 Answers

What about storing huge structures in parquet format while pickling it, this can be automated easily:

import io
from dataclasses import dataclass
import pickle
import numpy as np
import pandas as pd

@dataclass
class MyWorld:
    
    array: np.ndarray
    series: pd.Series
    frame: pd.DataFrame

@dataclass
class MyWorldParquet:
    
    array: np.ndarray
    series: pd.Series
    frame: pd.DataFrame
        
    def __getstate__(self):

        for key, value in self.__annotations__.items():
            
            if value is np.ndarray:
                self.__dict__[key] = pd.DataFrame({"_": self.__dict__[key]})
            
            if value is pd.Series:
                self.__dict__[key] = self.__dict__[key].to_frame()
        
            stream = io.BytesIO()
            self.__dict__[key].to_parquet(stream)
            
            self.__dict__[key] = stream
        
        return self.__dict__

    def __setstate__(self, data):
        
        self.__dict__.update(data)
        
        for key, value in self.__annotations__.items():
        
            self.__dict__[key] = pd.read_parquet(self.__dict__[key])
            
            if value is np.ndarray:
                self.__dict__[key] = self.__dict__[key]["_"].values
            
            if value is pd.Series:
                self.__dict__[key] = self.__dict__[key][self.__dict__[key].columns[0]]

Off course we will have some trade off between performance and volumetry as reducing the second requires format translation and compression.

Lets create a toy dataset:

N = 5_000_000
data = {
    "array": np.random.normal(size=N),
    "series": pd.Series(np.random.uniform(size=N), name="w"),
    "frame": pd.DataFrame({
        "c": np.random.choice(["label-1", "label-2", "label-3"], size=N),
        "x": np.random.uniform(size=N),
        "y": np.random.normal(size=N)
    })
}

We can compare the parquet conversion trade off (about 300 ms more):

%timeit -r 10 -n 1 pickle.dumps(MyWorld(**data))
# 1.57 s ± 162 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

%timeit -r 10 -n 1 pickle.dumps(MyWorldParquet(**data))
# 1.9 s ± 71.3 ms per loop (mean ± std. dev. of 10 runs, 1 loop each)

And the volumetry gain (about 40 Mb spared):

len(pickle.dumps(MyWorld(**data))) / 2 ** 20
# 200.28876972198486

len(pickle.dumps(MyWorldParquet(**data))) / 2 ** 20
# 159.13739013671875

Indeed those metrics will strongly depends on the actual dataset to be serialized.

like image 79
jlandercy Avatar answered Sep 19 '25 13:09

jlandercy


This could help you. Below article describes various ways and one of them is serialization the dataframe.

  • https://realpython.com/python-serialize-data/#binary-dataframes-parquet

Medium article with serialization benchmarks.

  • https://towardsdatascience.com/faster-dataframe-serialization-75205b6b7c69

The above articles would give you an idea i.e. how to approach the problem.

Also, can you try converting the dataframe to json using dj.to_json() and then reloading them back as a df?

https://pandas.pydata.org/docs/reference/api/pandas.DataFrame.to_json.html

like image 35
mohammed_ayaz Avatar answered Sep 19 '25 14:09

mohammed_ayaz