Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

how to compare two text document with tfidf vectorizer?

I have two different text which I want to compare using tfidf vectorization. What I am doing is:

  1. tokenizing each document
  2. vectorizing using TFIDFVectorizer.fit_transform(tokens_list)

Now the vectors that I get after step 2 are of different shape. But as per the concept, we should have the same shape for both the vectors. Only then the vectors can be compared.

What am I doing wrong? Please help.

Thanks in advance.

like image 819
akshit bhatia Avatar asked Sep 14 '25 03:09

akshit bhatia


1 Answers

As G. Anderson already pointed out, and to help the future guys on this, when we use the fit function of TFIDFVectorizer on document D1, it means that for the D1, the bag of words are constructed.

The transform() function computes the tfidf frequency of each word in the bag of word.

Now our aim is to compare the document D2 with D1. It means we want to see how many words of D1 match up with D2. Thats why we perform fit_transform() on D1 and then only the transform() function on D2 would apply the bag of words of D1 and count the inverse frequency of tokens in D2. This would give the relative comparison of D1 against D2.

like image 157
akshit bhatia Avatar answered Sep 15 '25 18:09

akshit bhatia