I would like to perform a regression analysis and test different transformations of the input variables for the same model. To accomplish this, I created a dictionary with the different pipelines, which I loop through:
import numpy as np
import pandas as pd
from sklearn.preprocessing import FunctionTransformer
from sklearn.linear_model import LinearRegression
from sklearn.pipeline import make_pipeline
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import PowerTransformer
from sklearn.compose import TransformedTargetRegressor
# Define transformations and models
models = {
'linear': LinearRegression(),
'power': make_pipeline(PowerTransformer(), LinearRegression()),
'log': make_pipeline(FunctionTransformer(np.log, np.exp),
LinearRegression()),
'log-sqrt': TransformedTargetRegressor(
regressor=make_pipeline(
FunctionTransformer(np.log, np.exp),
LinearRegression()),
func=np.sqrt,
inverse_func=np.square
)
}
parameters = pd.DataFrame()
for name, model in models.items():
model.fit(x_train, y_train)
y_hat = model.predict(x_hat)
y_hat_train = model.predict(x_train)
r2 = model.score(x_train, y_train)
parameters.at[name, 'MSE'] = mean_squared_error(y_train, y_hat_train)
parameters.at[name, 'R2'] = r2
best_model = parameters['R2'].idxmax()
This works. However, there is probably a more elegant solution similar to GridSearchCV for evaluating models. Can anyone give me some advice on what I should be looking for?
The first thing that comes to my mind is this:
pipeline = Pipeline([
("preproc", None),
("model", LinearRegression()),
])
params = {
"preproc": [
None,
PowerTransformer(),
FunctionTransformer(np.log, np.exp),
],
"model": [
LinearRegression(),
TransformedTargetRegressor(
LinearRegression(),
func=np.sqrt,
inverse_func=np.square,
),
],
}
search = GridSearchCV(
pipe,
params,
...
)
This doesn't net you cool names for the resultant models, and it also produces 6 models instead of your 4 (having additionally power-sqrt and plain sqrt). You also won't retain the actual trained models (aside from an optional-but-on-by-default final "best" estimator), if that's something you need. You'll get some automatic cross-validation, which is probably good. And you'll get parallelization for free.
This is heavily using the fact that entire pipeline steps can be set as a "hyperparameter" in the sklearn search API, and its recognition of None as a no-op step.
I believe Pipeline(preproc, TransformedTarget(model)) [that my approach produces] is operationally the same as TransformedTarget(Pipeline(preproc, model)) [that you've coded].
If the aim is to obtain the best performing pipeline in a hyperparameter search framework, then you could use Optuna as follows:
import optuna
def objective(trial):
# the preprocessing method as hyperparameter to optimize:
preproc = trial.suggest_categorical(
"preproc", [None, "power", "log", "log-sqrt"]
)
if preproc is None :
model = LinearRegression()
if preproc == "power":
make_pipeline(PowerTransformer(), LinearRegression())
if method == "log":
make_pipeline(
FunctionTransformer(np.log, np.exp), LinearRegression())
if method == "log-sqrt":
model = TransformedTargetRegressor(
regressor=make_pipeline(
FunctionTransformer(np.log, np.exp),
LinearRegression()),
func=np.sqrt,
inverse_func=np.square
)
score = cross_val_score(model, X_train, y_train, scoring='roc_auc', cv=3)
roc = score.mean()
return roc
# set up the study
study = optuna.create_study(
direction="maximize",
sampler=optuna.samplers.GridSampler()
)
study.optimize(objective, n_trials=15)
Haven't tested the code, so take it as guideline. Adapted from https://www.kaggle.com/code/solegalli/nested-hyperparameter-spaces-with-optuna
When you create a study in Optuna, with the param sampler, you can select if you want to use GridSearch, RandomSearch or other. For this one, as it is just 4 alternatives, gridsearch should be enough.
You can obtain the best pipeline like this:
study.best_params
And the results for each trial like this:
study.trials_dataframe()
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With