Reading scikit-learn doc on Pipeline, all the examples apply the transformers on the entire dataset (e.g. StandardScaler, PCA).
Is it possible to, say, only scale a specific variable in the dataset? If this is possible, then I can put my entire feature engineering process into a Pipeline and apply it on both my train and test sets.
You can use a combination of FeatureUnion and custom transformers that take only the variable you're interested in.
However, you're right in that sklearn does not handle heterogeneous feature sets particularly well. There is a library sklearn-pandas which makes it a lot easier, letting you define separate pipelines for specific columns of a pandas dataframe.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With