Goal: Predict labels on my original data
Background: I constructed an SVM classifier
I am using the following code:
0) Import modules
import numpy as np
from sklearn import cross_validation
from sklearn import datasets
from sklearn import svm
from sklearn.model_selection import StratifiedKFold
from sklearn.metrics import precision_score, recall_score,accuracy_score
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import precision_recall_fscore_support
1) X_list and y
type(X_list) #list, strings
len(X_list) #2163
type(y) #numpy.ndarray
len(y) #2163
2) convert X_list from string to float, use tfidf
tfidf = TfidfVectorizer()
X_vec = tfidf.fit_transform(X_list)
X = X_vec.toarray()
3) X shape
X.shape (2163, 8753)
4) 10 fold validation and SVM
skf = StratifiedKFold(n_splits=10)
clf = svm.SVC(kernel='linear', C=1)
5) loop through 10 folds
precision_scores = []
recall_scores = []
f_scores = []
for train_index, test_index in skf.split(X, y):
X_train = X[train_index]
X_test = X[test_index]
y_train = y[train_index]
y_test = y[test_index]
clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)
precision_scores.append(scores[0])
recall_scores.append(scores[1])
f_scores.append(scores[2])
6) Predict on original dataset X_original
type(X_original) #list, strings
len(X_original) #2163
7) Convert X_original from string to float
tfidf = TfidfVectorizer()
X_original_transform = tfidf.transform(X_original)
But when I do so I get the following Error
`NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted.`
SO has a similar question but it seems different from my issue NotFittedError: TfidfVectorizer - Vocabulary wasn't fitted
8) How do I fix this error?
In the point (7) above, you can see that you are initializing the tfidf again which generates a new instance of TfidfVectorizer which does not have any data or information. Then you are not fitting it. Hence the error.
You need to call fit() on it same way as you did in point (2).
Change point (7) to:
tfidf = TfidfVectorizer()
# fit_transform should be used here.
X_original_transform = tfidf.fit_transform(X_original)
Also in point (2), you are first fitting the TfidfVectorizer on whole of the dataset and then splitting it into train and test. This is not recommended as it leaks the information about the data to the model when training. Consider how this works in real world situation. Do you have all the information about the data that you want to predict in advance? No. You train the model on available data and use it on unseen data. Your current code in point (2) breaks this.
Always first split into train and test and then train (fit()) only on training data and use that information to apply (transform()) on testing data.
Change it like this:
1) First remove the code in point (2). We will be doing it inside the folds iteration.
2) Change point (5) like:
for train_index, test_index in skf.split(X_list, y):
X_train = X_list[train_index]
X_test = X_list[test_index]
y_train = y[train_index]
y_test = y[test_index]
tfidf = TfidfVectorizer()
# This is what I'm talking about
X_train = tfidf.fit_transform(X_train)
clf.fit(X_train, y_train)
# Only call transform() here
X_test = tfidf.transform(X_test)
y_pred = clf.predict(X_test)
precision_scores.append(scores[0])
recall_scores.append(scores[1])
f_scores.append(scores[2])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With