I want to create a Pipeline in Scikit-Learn with a specific step being outlier detection and removal, allowing the transformed data to be passed to other transformers and estimator.
I have searched SE but can't find this answer anywhere. Is this possible?
The imblearn package provides its own set of samplers, but we can also use a custom sampler with imblearn. FunctionSampler. We can take advantage of those features to create our function to remove outliers and call it within the pipeline as a sampler.
In simple terms, we can think of anomalies as unusual or unexpected data instances within a dataset. The term is often used interchangeably with outliers. Similarly, novelties are also anomalies in data, but they only exist in new instances.
The Scikit-learn pipeline is a tool that chains all steps of the workflow together for a more streamlined procedure. The key benefit of building a pipeline is improved readability. Pipelines are able to execute a series of transformations with one call, allowing users to attain results with less code.
Yes. Subclass the TransformerMixin and build a custom transformer. Here is an extension to one of the existing outlier detection methods:
from sklearn.pipeline import Pipeline, TransformerMixin
from sklearn.neighbors import LocalOutlierFactor
class OutlierExtractor(TransformerMixin):
    def __init__(self, **kwargs):
        """
        Create a transformer to remove outliers. A threshold is set for selection
        criteria, and further arguments are passed to the LocalOutlierFactor class
        Keyword Args:
            neg_conf_val (float): The threshold for excluding samples with a lower
               negative outlier factor.
        Returns:
            object: to be used as a transformer method as part of Pipeline()
        """
        self.threshold = kwargs.pop('neg_conf_val', -10.0)
        self.kwargs = kwargs
    def transform(self, X, y):
        """
        Uses LocalOutlierFactor class to subselect data based on some threshold
        Returns:
            ndarray: subsampled data
        Notes:
            X should be of shape (n_samples, n_features)
        """
        X = np.asarray(X)
        y = np.asarray(y)
        lcf = LocalOutlierFactor(**self.kwargs)
        lcf.fit(X)
        return (X[lcf.negative_outlier_factor_ > self.threshold, :],
                y[lcf.negative_outlier_factor_ > self.threshold])
    def fit(self, *args, **kwargs):
        return self
Then create a pipeline as:
pipe = Pipeline([('outliers', OutlierExtraction()), ...])
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With