sklearn: vectorizing in cross validation for text classification

Question

I have a question about using cross validation in text classification in sklearn. It is problematic to vectorize all data before cross validation, because the classifier would have "seen" the vocabulary occurred in the test data. Weka has filtered classifier to solve this problem. What is the sklearn equivalent for this function? I mean for each fold, the feature set would be different because the training data are different.

Fred Foo · Accepted Answer

The scikit-learn solution to this problem is to cross-validate a Pipeline of estimators, e.g.:

>>> from sklearn.cross_validation import cross_val_score
>>> from sklearn.feature_extraction.text import TfidfVectorizer
>>> from sklearn.pipeline import Pipeline
>>> from sklearn.svm import LinearSVC
>>> clf = Pipeline([('vect', TfidfVectorizer()), ('svm', LinearSVC())])

clf is now a composite estimator that does feature extraction and SVM model fitting. Given a list of documents (i.e. an ordinary Python list of string) documents and their labels y, calling

>>> cross_val_score(clf, documents, y)

will do feature extraction in each fold separately so that each of the SVMs knows only the vocabulary of its (k-1) folds training set.

sklearn: vectorizing in cross validation for text classification

Tags:

scikit-learn

user3466018

1 Answers

Fred Foo

Recent Activity

Donate For Us

sklearn: vectorizing in cross validation for text classification

Tags:

scikit-learn

user3466018

1 Answers

Fred Foo

Related questions

Recent Activity

Donate For Us