I am using the library hnswlib (https://github.com/nmslib/hnswlib ) library in Python to implement a speedy KNN search. I am wondering about parameters for large datasets.
I am using this benchmark from the official repository to test the libraries behavior on large datasets (vector dimension of 256+ with 1 million vectors+) https://github.com/nmslib/hnswlib/blob/master/examples/example.py
Testing with small datasets of a few 100k the recall results of this benchmark are quire good, usually around .9. Increasing to million this drops to .7
The authors of the library outline some general properties of the lib's parameters (https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) but finding a setup which leads to high recall on large data is time consuming as index build times take a few hours and increases further with choosing larger values for the parameters.
Are there some best-practice values for certain data dimensionality or number of data points? I understood this library is quite popular but I couldn't find any value recommendations.
I believe this GitHub issue answers your question. The steps outlined there for discovering the best parameters for your use case are:
M=16 and ef_construction=200.ef until you get a recall >= 0.95.ef_construction to the value discovered in step 2.ef_construction > 1,000, increase M.If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With