Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

embedding vector search efficient algorithm

background: I have a machine learning model in which given an object returns an embedding vector with dimension d, the model is trained in a way such that the semantic similarity of two embedding vectors is very close. Now, the verification process is relatively simple, I can take something like the cosine similarity of the two vectors. For recognition, it's a little bit complicated, either I can loop through all the anchor documents and compare the cosine similarity, or use something like kNN (online).

problem: I have a list of embedding vectors, each vector has a dimension d, with length N. Each vector contains floating-point data.

What will be an efficient data structure + algorithm that can do the following:

  1. Can add a new vector with a unique ID to the list efficiently (<= logarithmic complexity)
  2. Search with a random vector in the list, and retrieve top k vectors, such that the Manhattan distance / L1 norm is minimum for those vectors efficiently (hopefully, <= logarithmic complexity).

example:

[
 [1., 2., 3.],
 [5., 6., 8.],
 [-11., 2., 31.]
]

k = 2 query = [1.5, 2.5, 3.2] results:

[
 [1., 2., 3.],
 [5., 6., 8.],
]
like image 401
Zabir Al Nazi Avatar asked Sep 05 '25 03:09

Zabir Al Nazi


1 Answers

I think Faiss is exactly what you are looking for. The Github page is here, if you are interested in the implementation details (this is pretty technical) see here, and the tutorial is here.

like image 67
Phoenix Avatar answered Sep 07 '25 17:09

Phoenix