I am using the sklearn to classify the text into categories. I am using CountVectorizer and TFIDFTransformer to create the sparse matrix.
I am performing couple of pre-processing steps on string in the customtokenize_and_stem function used in CountVectorizer tokenizer.
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
SVM = Pipeline([('vect', CountVectorizer(max_features=100000,\
ngram_range= (1, 2),stop_words='english',tokenizer=tokenize_and_stem)),\
('tfidf', TfidfTransformer(use_idf= True)),\
('clf-svm', LinearSVC(C=1)),])
my question here is, if there is any easy way available to view/store the output of step 1/2 of Pipeline to analyse what kind of array is going into svm ?
You could get the intermediate steps output with something like this.
Based on the source code:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.pipeline import Pipeline
pipeline = Pipeline([('vect', TfidfVectorizer(ngram_range= (1, 2),stop_words='english')),\
('clf-svm', LinearSVC(C=1)),])
X= ["I want to test this document", "let us see how it works", "I am okay and you ?"]
pipeline.fit(X,[0,1,1])
print(pipeline.named_steps['vect'].get_feature_names())
['document', 'let', 'let works', 'okay', 'test', 'test document', 'want', 'want test', 'works']
#Here is where you can get the output of intermediate steps
Xt = X
for name, transform in pipeline.steps[:-1]:
if transform is not None:
Xt = transform.transform(Xt)
print(Xt)
(0, 7) 0.4472135954999579
(0, 6) 0.4472135954999579
(0, 5) 0.4472135954999579
(0, 4) 0.4472135954999579
(0, 0) 0.4472135954999579
(1, 8) 0.5773502691896257
(1, 2) 0.5773502691896257
(1, 1) 0.5773502691896257
(2, 3) 1.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With