Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Nonlinear feature transformation in python

In order to fit a linear regression model to some given training data X and labels y, i want to augment my training data X by nonlinear transformations of the given features. Let's say we have the feature x1, x2 and x3. And we want to use the additional transformed features:

x4 = x12, x5 = x22 and x6 = x32

x7 = exp(x1), x8 = exp(x2) and x9 = exp(x3)

x10 = cos(x1), x11 = cos(x2) and x12 = cos(x3)

I tried the following approach, which however lead to a model that performed very poorly in terms of Root Mean Squared Error as evaluation criterion:

import pandas as pd
import numpy as np
from sklearn import linear_model
#import the training data and extract the features and labels from it
DATAPATH = 'train.csv'
data = pd.read_csv(DATAPATH)
features = data.drop(['Id', 'y'], axis=1)
labels = data[['y']]

features['x6'] = features['x1']**2
features['x7'] = features['x2']**2
features['x8'] = features['x3']**2


features['x9'] = np.exp(features['x1'])
features['x10'] = np.exp(features['x2'])
features['x11'] = np.exp(features['x3'])


features['x12'] = np.cos(features['x1'])
features['x13'] = np.cos(features['x2'])
features['x14'] = np.cos(features['x3'])

regr = linear_model.LinearRegression()

regr.fit(features, labels)

I'm quite new to ML and there is for sure a better option to do these nonlinear feature transformations, I'm very happy for your help.

Cheers Lukas

like image 608
Lukas Avatar asked Oct 18 '25 14:10

Lukas


1 Answers

As initial remark, I think there is a better way to transform all columns. One option would be something like:

# Define list of transformation
trans = [lambda a: a, np.square, np.exp, np.cos]

# Apply and concatenate transformations
features = pd.concat([t(features) for t in trans], axis=1)

# Rename column names
features.columns = [f'x{i}' for i in range(1, len(list(features))+1)]

Regarding performances of the model, as @warped said in the comment, it is a usual practice to scale all your data. Depending of your data distribution you can use different types of scaler (a discussion about it standard vs minmax scaler).

Since you are using nonlinear transformations, even though your initial data may be normal distributed, after transformations they will lose such property. Therefore it may be better to use MinMaxScaler.

from sklearn.preprocessing import MinMaxScaler

scaler = MinMaxScaler()
scaler.fit(features.to_numpy())
scaled_features = scaler.transform(features.to_numpy())

Now each column of scaled_features will range from 0 to 1.

Remark that if you apply scaler before using something like train_test_split, data leakage may happen, and this is also not good for the model.

like image 144
FBruzzesi Avatar answered Oct 21 '25 05:10

FBruzzesi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!