Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

sklearn: vectorizing in cross validation for text classification

Tags:

scikit-learn

I have a question about using cross validation in text classification in sklearn. It is problematic to vectorize all data before cross validation, because the classifier would have "seen" the vocabulary occurred in the test data. Weka has filtered classifier to solve this problem. What is the sklearn equivalent for this function? I mean for each fold, the feature set would be different because the training data are different.

like image 430
user3466018 Avatar asked Mar 26 '14 20:03

user3466018


1 Answers

The scikit-learn solution to this problem is to cross-validate a Pipeline of estimators, e.g.:

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import LinearSVC
>>> clf = Pipeline([('vect', TfidfVectorizer()), ('svm', LinearSVC())])

clf is now a composite estimator that does feature extraction and SVM model fitting. Given a list of documents (i.e. an ordinary Python list of string) documents and their labels y, calling

>>> cross_val_score(clf, documents, y)

will do feature extraction in each fold separately so that each of the SVMs knows only the vocabulary of its (k-1) folds training set.

like image 160
Fred Foo Avatar answered Jan 02 '23 06:01

Fred Foo



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!