Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Feature scaling for a big dataset [closed]

I am trying to use a deep learning model for time series prediction, and before passing the data to the model I want to scale the different variables as they have widely different ranges.

I have normally done this "on the fly": load the training subset of the data set, obtain the scaler from the whole subset, store it and then load it when I want to use it for testing.

Now the data is pretty big and I will not load all the training data at once for training.

How could I go to obtain the scaler? A priori I thought of doing a one-time operation of loading all the data just to calculate the scaler (normally I use the sklearn scalers, like StandardScaler), and then load it when I do my training process.

Is this a common practice? If it is, how would you do if you add data to the training dataset? can scalers be combined to avoid that one-time operation and just "update" the scaler?

like image 433
rpicatoste Avatar asked Jan 29 '26 13:01

rpicatoste


1 Answers

StandardScaler in scikit-learn is able to calculate the mean and std of the data in incremental fashion (for small chunks of data) by using partial_fit():

partial_fit(X, y=None)

Online computation of mean and std on X for later scaling. All of X is processed as a single batch. This is intended for cases when fit is not feasible due to very large number of n_samples or because X is read from a continuous stream.

You will need two passes on the data:-

  • One complete pass (can be in batches, calling partial_fit() to calculate the mean and std),
  • Other pass on the data that you send ahead to the deep learning framework to transform() it on the fly.

Sample example:

from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()

# First pass
# some_generator can be anything which reads the data in batches
for data in some_generator:
    scaler.partial_fit(data)

    # View the updated mean and std variance at each batch
    print(scaler.mean_)
    print(scaler.var_)


# Second pass
for data in some_generator:
    scaled_data = scaler.transform(data)

    # Do whatever you want with the scaled_data
like image 68
Vivek Kumar Avatar answered Jan 31 '26 02:01

Vivek Kumar



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!