Semantic searching using Google flan-t5

Question

I'm trying to use google flan t5-large to create embeddings for a simple semantic search engine. However, the generated embeddings cosine similarity with my query is very off. Is there something I'm doing wrong?

import torch
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import euclidean

tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-large')
model = AutoModel.from_pretrained('google/flan-t5-large')

# Set the text to encode

def emebddings_generate(text):
  all_embeddings = []
  for i in text:
    input_ids = tokenizer.encode(i, return_tensors='pt')
    with torch.no_grad():
      embeddings = model(input_ids, decoder_input_ids=input_ids).last_hidden_state.mean(dim=1)
      all_embeddings.append((embeddings,i))
  return all_embeddings

def run_query(query,corpus):
  input_ids = tokenizer.encode(query, return_tensors='pt')
  with torch.no_grad():
        quer_emebedding=model(input_ids,decoder_input_ids=input_ids).last_hidden_state.mean(dim=1)

  similairtiy = []

  for embeds in corpus:
    sim = euclidean(embeds[0].flatten(),quer_emebedding.flatten())
    similairtiy.append((embeds[1],float(sim)))
  return similairtiy


text = ['some sad song', ' a very happy song']
corpus = emebddings_generate(text)

query = "I'm feeling so sad rn"
similairtiy = run_query( query,corpus)
for i in similairtiy:
  print(i)
  print(i[1],i[0])

I've tried different pooling techniques as well as using other distance metrics.

cronoik · Accepted Answer

The problem you face here is that you assume that FLAN's sentence embeddings are suited for similarity metrics, but that isn't the case. Jacob Devlin wrote once regarding BERT:

I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors.

But that isn't an issue, because FLAN is intended for other use cases. It was trained on different datasets with a suitable instruction prompt for that task to allow zero-shot prompting (i.e. performing tasks the model hasn't seen been trained on). That means you could perform your similarity task by formulating a proper prompt without any training. For example:

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM

model_id = "google/flan-t5-large"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)

prompt = """Which song fits the query.
QUERY: I'm feeling so sad rn 
OPTIONS 
-some sad song 
-a very happy song"""

input_ids = tokenizer(prompt, return_tensors="pt").input_ids  
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Output:

some sad song

Depending on your use case you might face issues when the number of options increases or when you want to work with the sentence embeddings. If this is the case, you should have a look at sentence-transformers. These are transformers that were trained to produce meaningful sentence embeddings and can therefore be used to calculate the cosine similarity of two sentences.

Semantic searching using Google flan-t5

Tags:

nlp

word-embedding

huggingface-transformers

semantic-search

Affan Mir

1 Answers

cronoik

Recent Activity

Donate For Us

Semantic searching using Google flan-t5

Tags:

nlp

word-embedding

huggingface-transformers

semantic-search

Affan Mir

1 Answers

cronoik

Related questions

Recent Activity

Donate For Us