Is there a pythonic way to chain together sklearn's StandardScaler instances to independently scale data with groups? I.e., if I wanted to find independently scale the features of the iris dataset; I could use the following code:
from sklearn.datasets import load_iris
data = load_iris()
df = pd.DataFrame(data['data'], columns=data['feature_names'])
df['class'] = data['target']
means = df.groupby('class').mean()
stds = df.groupby('class').std()
df_rescaled = (
    (df.drop(['class'], 1) - means.reindex(df['class']).values) / 
     stds.reindex(df['class']).values)
Here, I'm subtracting by the mean and dividing by the stdev of each group independently. But Its somewhat hard to carry around these means and stdev's, and essentially, replicate the behavior of StandardScaler when I have a categorical variable I'd like to control for.
Is there a more pythonic / sklearn-friendly way to implement this type of scaling?
Sure, you can use any sklearn operation and apply it to a groupby object.
First, a little convenience wrapper:
import typing
import pandas as pd
class SklearnWrapper:
    def __init__(self, transform: typing.Callable):
        self.transform = transform
    def __call__(self, df):
        transformed = self.transform.fit_transform(df.values)
        return pd.DataFrame(transformed, columns=df.columns, index=df.index)
This one will apply any sklearn transform you pass into it to a group. 
And finally simple usage:
from sklearn.datasets import load_iris
from sklearn.preprocessing import StandardScaler
data = load_iris()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
df["class"] = data["target"]
df_rescaled = (
    df.groupby("class")
    .apply(SklearnWrapper(StandardScaler()))
    .drop("class", axis="columns")
)
EDIT: You can pretty much do anything with SklearnWrapper.
Here is an example of transforming and reversing this operation for each group (e.g. do not overwrite the transformation object) - just fit the object anew each time a new group is seen (and add it to list). 
I have kinda replicated a bit of sklearn's functionality for easier usage (you can extend it with any function you want by passing appropriate string to _call_with_function internal method):
class SklearnWrapper:
    def __init__(self, transformation: typing.Callable):
        self.transformation = transformation
        self._group_transforms = []
        # Start with -1 and for each group up the pointer by one
        self._pointer = -1
    def _call_with_function(self, df: pd.DataFrame, function: str):
        # If pointer >= len we are making a new apply, reset _pointer
        if self._pointer >= len(self._group_transforms):
            self._pointer = -1
        self._pointer += 1
        return pd.DataFrame(
            getattr(self._group_transforms[self._pointer], function)(df.values),
            columns=df.columns,
            index=df.index,
        )
    def fit(self, df):
        self._group_transforms.append(self.transformation.fit(df.values))
        return self
    def transform(self, df):
        return self._call_with_function(df, "transform")
    def fit_transform(self, df):
        self.fit(df)
        return self.transform(df)
    def inverse_transform(self, df):
        return self._call_with_function(df, "inverse_transform")
Usage (group transform, inverse operation and apply it again):
data = load_iris()
df = pd.DataFrame(data["data"], columns=data["feature_names"])
df["class"] = data["target"]
# Create scaler outside the class
scaler = SklearnWrapper(StandardScaler())
# Fit and transform data (holding state)
df_rescaled = df.groupby("class").apply(scaler.fit_transform)
# Inverse the operation
df_inverted = df_rescaled.groupby("class").apply(scaler.inverse_transform)
# Apply transformation once again
df_transformed = (
    df_inverted.groupby("class")
    .apply(scaler.transform)
    .drop("class", axis="columns")
)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With