I have been able to make a MultinomialNB classifier and save it into a pickle file for later use (credit goes to youtube video: https://www.youtube.com/watch?v=0kPRaYSgblM&t=927s and a few more). below is my code:
import sklearn.datasets as skd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
import pickle
categories = ['alt.atheism', 'soc.religion.christian', 'comp.graphics', 'sci.med']
train_data = skd.load_files('E:/Python/Datasets/train', categories=categories, encoding='ISO-8859-1')
test_data = skd.load_files('E:/Python/Datasets/test', categories=categories, encoding='ISO-8859-1')
tf_vect = TfidfVectorizer()
tfidf_train = tf_vect.fit_transform(train_data.data)
clf = MultinomialNB().fit(tfidf_train, train_data.target)
with open('classifier', 'wb') as picklefile:
pickle.dump(clf, picklefile)
Now in a separate code file I can read it back into a new variable 'new_clf' to use this classifier with new text data:
import pickle
with open('E:\Python\Text Classification\classifier', 'rb') as tm:
new_clf = pickle.load(tm)
Now, if i had directly run the previous file/code and had the tf_vect variable from it which already had my training data fitted into it i can simply transform the new set of texts on it and pass it to the new_clf classifier to get predictions.
But in my case once the model is trained i want to send it to another user who will have a separate code file which has to read the classifier and then pass new text to it for prediction.
What issue i get here is in below code (ends with an ValueError: dimension mismatch):-
new_text = ['God is Love', 'OpenGL is fast on GPU']
new_clf.predict(new_text)
I understand that i am not transforming the new_text per the features of training data. But am not able to figure out how to solve for it.
Shall i create another pickle file which will contain the tf_vect and share it with user? Or it already goes with the classifier file itself and i am missing the process of getting it from classifier?
You could indeed save two pickle files, one for the vectorizer and one for the classifier. However, the most convenient and recommended solution for this is to combine the vectorizer and the classifier into one Pipeline object, which you can then pickle.
from sklearn.pipeline import Pipeline
tf_vect = TfidfVectorizer()
clf = MultinomialNB()
pipe = Pipeline([("vectorizer", tf_vect), ("classifier", clf)])
pipe.fit(train_data.data, train_data.target)
with open('classifier', 'wb') as picklefile:
pickle.dump(pipe, picklefile)
Once you then load that pickle file, you can use it with new text like this:
with open('/.../classifier', 'rb') as tm:
new_pipe = pickle.load(tm)
new_pipe.predict(new_text)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With