I am trying to create a machine learning library for sparse embedding training. So it would need fast read/write of millions of embeddings that have dimensions ranging from 128 to 768. Each of these dimensions will have a float32 value.
There will only be a single column, each row for each embedding. I am Not doing any embedding similarity search or anything like that, only the index # will be needed to look up the embedding.
Each update step in the training would be looking up and writing values do the data store, so I am looking for the fastest database for my situation. Having a parameters saved to disk would already significantly reduce ram memory, so usage of Ram memory is Not a concern for me.
From my limited research, it looks like the top candidates are parquet, hdf5, or some sort of SQL.
If there are any other requirements needed to recommend the best datastore, let me know.
It is a broad topic, but the first thing is that if speed matters, most of the answer does not necessarily lies in the choice of the database, but how often you write, because if you keep things in the RAM most of the time and do batch updates every so often, you will make it faster. A mix of cache and database basically.
Personally I would pick redis because it can actually do both cache and also persist on disk. The configuration even allows you to choose between different types of persistence, some of them being more efficient than others. Redis is already faster than most options by design, but the configuration lets you refine it. And it has a driver for pretty much all languages out there.
Also it may seem obvious but if you can, it is faster to have your cache/database on the same machine as your application, if you want to gain even more speed.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With