Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Semantic searching using Google flan-t5

I'm trying to use google flan t5-large to create embeddings for a simple semantic search engine. However, the generated embeddings cosine similarity with my query is very off. Is there something I'm doing wrong?

import torch
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import euclidean

tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-large')
model = AutoModel.from_pretrained('google/flan-t5-large')

# Set the text to encode

def emebddings_generate(text):
  all_embeddings = []
  for i in text:
    input_ids = tokenizer.encode(i, return_tensors='pt')
    with torch.no_grad():
      embeddings = model(input_ids, decoder_input_ids=input_ids).last_hidden_state.mean(dim=1)
      all_embeddings.append((embeddings,i))
  return all_embeddings

def run_query(query,corpus):
  input_ids = tokenizer.encode(query, return_tensors='pt')
  with torch.no_grad():
        quer_emebedding=model(input_ids,decoder_input_ids=input_ids).last_hidden_state.mean(dim=1)

  similairtiy = []

  for embeds in corpus:
    sim = euclidean(embeds[0].flatten(),quer_emebedding.flatten())
    similairtiy.append((embeds[1],float(sim)))
  return similairtiy


text = ['some sad song', ' a very happy song']
corpus = emebddings_generate(text)

query = "I'm feeling so sad rn"
similairtiy = run_query( query,corpus)
for i in similairtiy:
  print(i)
  print(i[1],i[0])

I've tried different pooling techniques as well as using other distance metrics.

like image 429
Affan Mir Avatar asked Oct 28 '25 19:10

Affan Mir


1 Answers

The problem you face here is that you assume that FLAN's sentence embeddings are suited for similarity metrics, but that isn't the case. Jacob Devlin wrote once regarding BERT:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors.

But that isn't an issue, because FLAN is intended for other use cases. It was trained on different datasets with a suitable instruction prompt for that task to allow zero-shot prompting (i.e. performing tasks the model hasn't seen been trained on). That means you could perform your similarity task by formulating a proper prompt without any training. For example:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "google/flan-t5-large"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

prompt = """Which song fits the query.
QUERY: I'm feeling so sad rn 
OPTIONS 
-some sad song 
-a very happy song"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids  
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

some sad song

Depending on your use case you might face issues when the number of options increases or when you want to work with the sentence embeddings. If this is the case, you should have a look at sentence-transformers. These are transformers that were trained to produce meaningful sentence embeddings and can therefore be used to calculate the cosine similarity of two sentences.

like image 121
cronoik Avatar answered Oct 31 '25 12:10

cronoik



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!