Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Different embedding checksums after encoding with SentenceTransformers?

I am calculating some embeddings with SentenceTransformers Library. However, I get different results when encoding the sentences and calculating their embeddings when checking the sum of their values. For instance:

In:


RANDOM_SEED = 42
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)


transformer_models = [
    'M-CLIP/M-BERT-Distil-40', 
                     ]

sentences = df['content'].tolist()


for transformer_model in tqdm(transformer_models, desc="Transformer Models"):
    tqdm.write(f"Processing with Transformer Model: {transformer_model}")
    model = SentenceTransformer(transformer_model)
    embeddings = model.encode(sentences)
    print(f"Embeddings Checksum for {transformer_model}:", np.sum(embeddings))

Out:

Embeddings Checksum for M-CLIP/M-BERT-Distil-40: 1105.9185

Or

Embeddings Checksum for M-CLIP/M-BERT-Distil-40: 1113.5422

I noticed this situation happens when I restart and clear the output of the jupyter notebook, and then re-run the full notebook. Any idea of how to fix this issue?

Alternative I tried to set after and before the embeddings calculation the reandom seeds:

import torch
import numpy as np
import random
import tensorflow as tf
from sentence_transformers import SentenceTransformer
from tqdm.auto import tqdm

RANDOM_SEED = 42

# Setting seeds
np.random.seed(RANDOM_SEED)
random.seed(RANDOM_SEED)
tf.random.set_seed(RANDOM_SEED)
torch.manual_seed(RANDOM_SEED)

# Ensuring PyTorch determinism
torch.backends.cudnn.deterministic = True
torch.backends.cudnn.benchmark = False

transformer_models = ['M-CLIP/M-BERT-Distil-40']

sentences = df['content'].tolist()

for transformer_model in tqdm(transformer_models, desc="Transformer Models"):
    # Set the seed again right before loading the model
    np.random.seed(RANDOM_SEED)
    random.seed(RANDOM_SEED)
    tf.random.set_seed(RANDOM_SEED)
    torch.manual_seed(RANDOM_SEED)

    tqdm.write(f"Processing with Transformer Model: {transformer_model}")
    model = SentenceTransformer(transformer_model, device='cpu')  # Force to use CPU

    embeddings = model.encode(sentences, show_progress_bar=False)  # Disable progress bar and parallel tokenization
    print(f"Embeddings Checksum for {transformer_model}:", np.sum(embeddings))

However I am getting the same inconsistent behavior.

UPDATE

What I tried now, and seem to work is that now I store all the calculated embeddings in files. However, I find weird that when doing this I get different results. Does anyone has experience this before?

like image 582
tumbleweed Avatar asked Jan 20 '26 12:01

tumbleweed


1 Answers

Please try using the .apply method instead of .encode --> for me, that worked for a similar application which helped to resolve the reproducibility issue.

It seems to be an ongoing issue faced by many people. You can follow this issue for more information on different embeddings depending on different batch sizes/other settings and their solutions. [different precision settings, specifying/ensuring consistent tokenization and padding of the input, etc.]

like image 101
Soham Kanti Bera Avatar answered Jan 22 '26 01:01

Soham Kanti Bera