background: I have a machine learning model in which given an object returns an embedding vector with dimension d, the model is trained in a way such that the semantic similarity of two embedding vectors is very close. Now, the verification process is relatively simple, I can take something like the cosine similarity of the two vectors. For recognition, it's a little bit complicated, either I can loop through all the anchor documents and compare the cosine similarity, or use something like kNN (online).
problem: I have a list of embedding vectors, each vector has a dimension d, with length N. Each vector contains floating-point data.
What will be an efficient data structure + algorithm that can do the following:
example:
[
[1., 2., 3.],
[5., 6., 8.],
[-11., 2., 31.]
]
k = 2
query = [1.5, 2.5, 3.2]
results:
[
[1., 2., 3.],
[5., 6., 8.],
]
I think Faiss is exactly what you are looking for. The Github page is here, if you are interested in the implementation details (this is pretty technical) see here, and the tutorial is here.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With