I'm trying to use google flan t5-large to create embeddings for a simple semantic search engine. However, the generated embeddings cosine similarity with my query is very off. Is there something I'm doing wrong?
import torch
from transformers import AutoTokenizer, AutoModel
import torch
from sklearn.metrics.pairwise import cosine_similarity
from scipy.spatial.distance import euclidean
tokenizer = AutoTokenizer.from_pretrained('google/flan-t5-large')
model = AutoModel.from_pretrained('google/flan-t5-large')
# Set the text to encode
def emebddings_generate(text):
all_embeddings = []
for i in text:
input_ids = tokenizer.encode(i, return_tensors='pt')
with torch.no_grad():
embeddings = model(input_ids, decoder_input_ids=input_ids).last_hidden_state.mean(dim=1)
all_embeddings.append((embeddings,i))
return all_embeddings
def run_query(query,corpus):
input_ids = tokenizer.encode(query, return_tensors='pt')
with torch.no_grad():
quer_emebedding=model(input_ids,decoder_input_ids=input_ids).last_hidden_state.mean(dim=1)
similairtiy = []
for embeds in corpus:
sim = euclidean(embeds[0].flatten(),quer_emebedding.flatten())
similairtiy.append((embeds[1],float(sim)))
return similairtiy
text = ['some sad song', ' a very happy song']
corpus = emebddings_generate(text)
query = "I'm feeling so sad rn"
similairtiy = run_query( query,corpus)
for i in similairtiy:
print(i)
print(i[1],i[0])
I've tried different pooling techniques as well as using other distance metrics.
The problem you face here is that you assume that FLAN's sentence embeddings are suited for similarity metrics, but that isn't the case. Jacob Devlin wrote once regarding BERT:
I'm not sure what these vectors are, since BERT does not generate meaningful sentence vectors.
But that isn't an issue, because FLAN is intended for other use cases. It was trained on different datasets with a suitable instruction prompt for that task to allow zero-shot prompting (i.e. performing tasks the model hasn't seen been trained on). That means you could perform your similarity task by formulating a proper prompt without any training. For example:
from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
model_id = "google/flan-t5-large"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForSeq2SeqLM.from_pretrained(model_id)
prompt = """Which song fits the query.
QUERY: I'm feeling so sad rn
OPTIONS
-some sad song
-a very happy song"""
input_ids = tokenizer(prompt, return_tensors="pt").input_ids
outputs = model.generate(input_ids)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
Output:
some sad song
Depending on your use case you might face issues when the number of options increases or when you want to work with the sentence embeddings. If this is the case, you should have a look at sentence-transformers. These are transformers that were trained to produce meaningful sentence embeddings and can therefore be used to calculate the cosine similarity of two sentences.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With