Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

hnswlib parameters for large datasets?

Tags:

python

knn

I am using the library hnswlib (https://github.com/nmslib/hnswlib ) library in Python to implement a speedy KNN search. I am wondering about parameters for large datasets.

I am using this benchmark from the official repository to test the libraries behavior on large datasets (vector dimension of 256+ with 1 million vectors+) https://github.com/nmslib/hnswlib/blob/master/examples/example.py

Testing with small datasets of a few 100k the recall results of this benchmark are quire good, usually around .9. Increasing to million this drops to .7

The authors of the library outline some general properties of the lib's parameters (https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) but finding a setup which leads to high recall on large data is time consuming as index build times take a few hours and increases further with choosing larger values for the parameters.

Are there some best-practice values for certain data dimensionality or number of data points? I understood this library is quite popular but I couldn't find any value recommendations.

like image 711
user1091534 Avatar asked Oct 23 '25 19:10

user1091534


1 Answers

I believe this GitHub issue answers your question. The steps outlined there for discovering the best parameters for your use case are:

  1. Start out with M=16 and ef_construction=200.
  2. Run benchmarks, iterating over ef until you get a recall >= 0.95.
  3. Re-index by setting ef_construction to the value discovered in step 2.
  4. If ef_construction > 1,000, increase M.
like image 87
Leo B Avatar answered Oct 26 '25 07:10

Leo B