I have a pandas DataFrame I want to query often (in ray via an API). I'm trying to speed up the loading of it but it takes significant time (3+s) to cast it into pandas. For most of my datasets it's fast but this one is not. My guess is that it's because 90% of these are strings.
[742461 rows x 248 columns]
Which is about 137MB on disk. To eliminate disk speed as a factor I've placed the .parq file in a tmpfs mount.
Now I've tried:
Anyone has any good tips on how to speed up pandas conversion when dealing with strings? I have plenty of cores and ram.
My end results wants to be a pandas DataFrame, so I'm not bound to the parquet file format although it's generally my favourite.
Regards, Niklas
In the end I reduced the time by more carefully handling the data, mainly by removing blank values, making sure we had as much NA values as possible (instead of blank strings etc) and making categories on all text data with less than 50% unique content.
I ended up generating the schemas via PyArrow so I could create categorical values with a custom index size (int64 instead of int16) so my categories could hold more values. The data size was reduces by 50% in the end.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With