I am working on a classification task with Scikit-learn. I have a data set in which each observation comprises two separate text fields. I want to set up a Pipeline in which each text field is passed in parallel through its own TfidfVectorizer and the outputs of the TfidfVectorizer objects are passed to a classifier. My aim is to be able to optimize the parameters of the two TfidfVectorizer objects along with those of the classifier, using GridSearchCV.
The Pipeline might be depicted as follows:
Text 1 -> TfidfVectorizer 1 --------|
                                    +---> Classifier
Text 2 -> TfidfVectorizer 2 --------|
I understand how to do this without using a Pipeline (by just creating to TfidfVectorizer objects and working from there), but how do I set this up inside a Pipeline?
Thanks for any help,
Rob.
Use the Pipeline and FeatureUnion classes. The code for your case would look something like:
pipeline = Pipeline([
  ('features', FeatureUnion([
    ('c1', Pipeline([
      ('text1', ExtractText1()),
      ('tf_idf1', TfidfVectorizer())
    ])),
    ('c2', Pipeline([
      ('text2', ExtractText2()),
      ('tf_idf2', TfidfVectorizer())
    ]))
  ])),
  ('classifier', MultinomialNB())
])
You can do a grid search over the entire structure by referring to the parameters by using the <estimator1>__<estimator2>__<parameter> syntax. For example features__c1__tf_idf1__min_df refers to the min_df parameter of TfidfVectorizer 1 from your diagram.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With