Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How does tfidf transform test data after being fitted to train data?

I am using the following code:

pipeline = Pipeline([('vect', 
                      TfidfVectorizer( ngram_range=(1,2),
                                       stop_words="english", 
                                       sublinear_tf=True ,
                                       use_idf=True, 
                                       norm='l2' )),
                     ('reduce_dim',
                      SelectPercentile(f_classif, 90)),
                     ('clf', 
                      SVC(kernel='linear',C=1.0, 
                          probability=True, max_iter=70000, 
                          class_weight='balanced'))])

model = pipeline.fit(X_train,y_train)
model.predict(X_test)

x=vectorizer.fit_transform(X_train_text)
y=vectorizer.transform(X_test_text)

As per my understanding, pipeline.fit() fits tfidf to the train data and when model.predict() is called on X_test, it only does a tfidf transformation based on the fitted train data.

Since tf idf works by getting frequency of words in the document and corpus, I am wondering what happens underneath in the .fit_transform and .transform functions.

like image 501
Sakshi Jajodia Avatar asked Sep 19 '25 08:09

Sakshi Jajodia


2 Answers

1) pretty close to your question you can find here:What is the difference between TfidfVectorizer.fit_transfrom and tfidf.transform?

2)tfidf transformation is done inside of fit-transform, predict here doesn't correspond to tfidf vectorizer, as it doesnt have such a function, it is method of SVC.

like image 81
Igor sharm Avatar answered Sep 20 '25 21:09

Igor sharm


Here is the basic documentation of fit() and fit_transform().

Your understanding of the working is correct. When testing the parameters are set for the tf-idf Vectorizer. These parameters are stored and used later to just transform the testing data.

  • Training data - fit_transform()
  • Testing data - transform()

If you want to look at the inside workings, you should have a look at the source code for the same.

like image 45
skillsmuggler Avatar answered Sep 20 '25 23:09

skillsmuggler