I am using the following code:
pipeline = Pipeline([('vect',
TfidfVectorizer( ngram_range=(1,2),
stop_words="english",
sublinear_tf=True ,
use_idf=True,
norm='l2' )),
('reduce_dim',
SelectPercentile(f_classif, 90)),
('clf',
SVC(kernel='linear',C=1.0,
probability=True, max_iter=70000,
class_weight='balanced'))])
model = pipeline.fit(X_train,y_train)
model.predict(X_test)
x=vectorizer.fit_transform(X_train_text)
y=vectorizer.transform(X_test_text)
As per my understanding, pipeline.fit()
fits tfidf to the train data and when model.predict()
is called on X_test
, it only does a tfidf transformation based on the fitted train data.
Since tf idf works by getting frequency of words in the document and corpus, I am wondering what happens underneath in the .fit_transform
and .transform
functions.
1) pretty close to your question you can find here:What is the difference between TfidfVectorizer.fit_transfrom and tfidf.transform?
2)tfidf transformation is done inside of fit-transform
, predict
here doesn't correspond to tfidf vectorizer, as it doesnt have such a function, it is method of SVC.
Here is the basic documentation of fit()
and fit_transform()
.
Your understanding of the working is correct. When testing the parameters are set for the tf-idf Vectorizer
. These parameters are stored and used later to just transform the testing data.
fit_transform()
transform()
If you want to look at the inside workings, you should have a look at the source code for the same.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With