Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted python

Goal: Predict labels on my original data

Background: I constructed an SVM classifier

I am using the following code:

0) Import modules

    import numpy as np
    from sklearn import cross_validation
    from sklearn import datasets
    from sklearn import svm
    from sklearn.model_selection import StratifiedKFold
    from sklearn.metrics import precision_score, recall_score,accuracy_score
    from sklearn.feature_extraction.text import TfidfVectorizer
    from sklearn.metrics import precision_recall_fscore_support

1) X_list and y

type(X_list) #list, strings
len(X_list)  #2163
type(y) #numpy.ndarray
len(y)  #2163

2) convert X_list from string to float, use tfidf

tfidf = TfidfVectorizer()
X_vec = tfidf.fit_transform(X_list) 
X = X_vec.toarray()

3) X shape

X.shape  (2163, 8753)

4) 10 fold validation and SVM

skf = StratifiedKFold(n_splits=10) 
clf = svm.SVC(kernel='linear', C=1)

5) loop through 10 folds

precision_scores = []
recall_scores = []
f_scores = [] 

for train_index, test_index in skf.split(X, y): 
    X_train = X[train_index]
    X_test =  X[test_index]
    y_train = y[train_index]
    y_test =  y[test_index]

    clf.fit(X_train, y_train) 
    y_pred = clf.predict(X_test)

    precision_scores.append(scores[0])
    recall_scores.append(scores[1])
    f_scores.append(scores[2])

6) Predict on original dataset X_original

type(X_original) #list, strings
len(X_original)  #2163

7) Convert X_original from string to float

tfidf = TfidfVectorizer()
X_original_transform = tfidf.transform(X_original) 

But when I do so I get the following Error

`NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.`

SO has a similar question but it seems different from my issue NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted

8) How do I fix this error?


1 Answers

In the point (7) above, you can see that you are initializing the tfidf again which generates a new instance of TfidfVectorizer which does not have any data or information. Then you are not fitting it. Hence the error. You need to call fit() on it same way as you did in point (2).

Change point (7) to:

tfidf = TfidfVectorizer()
# fit_transform should be used here.
X_original_transform = tfidf.fit_transform(X_original) 

Also in point (2), you are first fitting the TfidfVectorizer on whole of the dataset and then splitting it into train and test. This is not recommended as it leaks the information about the data to the model when training. Consider how this works in real world situation. Do you have all the information about the data that you want to predict in advance? No. You train the model on available data and use it on unseen data. Your current code in point (2) breaks this.

Always first split into train and test and then train (fit()) only on training data and use that information to apply (transform()) on testing data.

Change it like this:

1) First remove the code in point (2). We will be doing it inside the folds iteration.

2) Change point (5) like:

for train_index, test_index in skf.split(X_list, y): 
    X_train = X_list[train_index]
    X_test =  X_list[test_index]
    y_train = y[train_index]
    y_test =  y[test_index]

    tfidf = TfidfVectorizer()

    # This is what I'm talking about
    X_train = tfidf.fit_transform(X_train) 
    clf.fit(X_train, y_train) 

    # Only call transform() here
    X_test = tfidf.transform(X_test) 
    y_pred = clf.predict(X_test)

    precision_scores.append(scores[0])
    recall_scores.append(scores[1])
    f_scores.append(scores[2])
like image 144
Vivek Kumar Avatar answered Jan 21 '26 08:01

Vivek Kumar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!