I'm currently working on an information retrieval task. I'm using SBERT to perform a semantic search. I already follows the documentation here
The model i use
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')
The outline is
data = ['A man is eating food.',
'A man is eating a piece of bread.',
'The girl is carrying a baby.',
'A man is riding a horse.',
'A woman is playing violin.',
'Two men pushed carts through the woods.',
'A man is riding a white horse on an enclosed ground.',
'A monkey is playing drums.',
'A cheetah is running behind its prey.'
]
queries = ['A man is eating pasta.']
query_embedding = model.encode(query)
doc_embedding = model.encode(data)
the encode function outputs a numpy.ndarray like this outputs of model.encode(data)
similarity = util.cos_sim(query_embedding, doc_embedding)
tensor([[0.4389, 0.4288, 0.6079, 0.5571, 0.4063, 0.4432, 0.5467, 0.3392, 0.4293]])
And it works fine and fast. But ofcourse it is only using a small amount of corpus. When using a large amount of corpus it will take time for the encoding to work.
note: The encoding of query takes no time because it is only one sentence, but the encoding of the corpus will take some time
So, the question is can we save the doc_embedding locally, and use it again? especially when using a large corpus
is there any built-in class/function to do it from the transformers?
Save them as pickle files and load them later = ]
import pickle
with open('doc_embedding.pickle', 'wb') as pkl:
pickle.dump(doc_embedding, pkl)
with open('doc_embedding.pickle', 'rb') as pkl:
doc_embedding = pickle.load(pkl)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With