Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to save sentence-Bert output vectors to a file?

I am using Bert to get similarity between multi term words.here is my code that I used for embedding :

from sentence_transformers import SentenceTransformer
model = SentenceTransformer('bert-large-uncased-whole-word-masking')
words = [
"Artificial intelligence",
"Data mining",
"Political history",
"Literature book"]

I also have a dataset which contains 540000 other words.

Vocabs = [
"Winter flooding",
"Cholesterol diet", ....]

the problem is when I want to embed Vocabs to vectors it takes time forever.

words_embeddings = model.encode(words)
Vocabs_embeddings = model.encode(Vocabs)

is there any way to make it faster? or I want to embed Vocabs in for loops and save the output vectors in a file so I don't have to embed 540000 vocabs every time I need it. is there a way to save embeddings to a file and use it again? I will really appreciate you for your time trying help me.

like image 451
Sahar Rezazadeh Avatar asked Nov 16 '25 08:11

Sahar Rezazadeh


1 Answers

You can pickle your corpus and embeddings like this, you can also pickle a dictionary instead, or write them to file in any other format you prefer.

import pickle
with open("my-embeddings.pkl", "wb") as fOut:
    pickle.dump({'sentences': words, 'embeddings': word_embeddings},fOut)

Or more generally like below, so you encode when the embeddings don't exist but after that any time you need them you load from file, instead of re-encoding your corpus:

if not os.path.exists(embedding_cache_path):
    # read your corpus etc
    corpus_sentences = ...
    print("Encoding the corpus. This might take a while")
    corpus_embeddings = model.encode(corpus_sentences, show_progress_bar=True, convert_to_numpy=True)
    corpus_embeddings = corpus_embeddings / np.linalg.norm(corpus_embeddings, axis=1, keepdims=True)

    print("Storing file on disc")
    with open(embedding_cache_path, "wb") as fOut:
        pickle.dump({'sentences': corpus_sentences, 'embeddings': corpus_embeddings}, fOut)

else:
    print("Loading pre-computed embeddings from disc")
    with open(embedding_cache_path, "rb") as fIn:
        cache_data = pickle.load(fIn)
        corpus_sentences = cache_data['sentences']
        corpus_embeddings = cache_data['embeddings']
like image 159
Sara Moradlou Avatar answered Nov 18 '25 19:11

Sara Moradlou