hnswlib parameters for large datasets?

Question

I am using the library hnswlib (https://github.com/nmslib/hnswlib ) library in Python to implement a speedy KNN search. I am wondering about parameters for large datasets.

I am using this benchmark from the official repository to test the libraries behavior on large datasets (vector dimension of 256+ with 1 million vectors+) https://github.com/nmslib/hnswlib/blob/master/examples/example.py

Testing with small datasets of a few 100k the recall results of this benchmark are quire good, usually around .9. Increasing to million this drops to .7

The authors of the library outline some general properties of the lib's parameters (https://github.com/nmslib/hnswlib/blob/master/ALGO_PARAMS.md) but finding a setup which leads to high recall on large data is time consuming as index build times take a few hours and increases further with choosing larger values for the parameters.

Are there some best-practice values for certain data dimensionality or number of data points? I understood this library is quite popular but I couldn't find any value recommendations.

Leo B · Accepted Answer

I believe this GitHub issue answers your question. The steps outlined there for discovering the best parameters for your use case are:

Start out with M=16 and ef_construction=200.
Run benchmarks, iterating over ef until you get a recall >= 0.95.
Re-index by setting ef_construction to the value discovered in step 2.
If ef_construction > 1,000, increase M.

hnswlib parameters for large datasets?

Tags:

python

knn

user1091534

1 Answers

Leo B

Recent Activity

Donate For Us

hnswlib parameters for large datasets?

Tags:

python

knn

user1091534

1 Answers

Leo B

Related questions

Recent Activity

Donate For Us