How to run model on new data that requires pd.get_dummies?

Question

I have a model that runs the following:

import pandas as pd
import numpy as np

# initialize list of lists 
data = [['tom', 10,1,'a'], ['tom', 15,5,'a'], ['tom', 14,1,'a'], ['tom', 15,4,'b'], ['tom', 18,1,'b'], ['tom', 15,6,'a'], ['tom', 17,3,'a']
       , ['tom', 14,7,'b'], ['tom',16 ,6,'a'], ['tom', 22,2,'a'],['matt', 10,1,'c'], ['matt', 15,5,'b'], ['matt', 14,1,'b'], ['matt', 15,4,'a'], ['matt', 18,1,'a'], ['matt', 15,6,'a'], ['matt', 17,3,'a']
       , ['matt', 14,7,'c'], ['matt',16 ,6,'b'], ['matt', 10,2,'b']]

# Create the pandas DataFrame 
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category']) 

print(df.head(2))
  Name  Attempts  Score Category
0  tom        10      1        a
1  tom        15      5        a

Then I have created a dummy df to use in the model using the following code:

from sklearn.linear_model import LogisticRegression

df_dum = pd.get_dummies(df)
print(df_dum.head(2))
  Attempts  Score  Name_matt  Name_tom  Category_a  Category_b  Category_c
0        10      1          0         1           1           0           0
1        15      5          0         1           1           0           0

Then I have created the following model:

#Model

X = df_dum.drop(('Score'),axis=1)
y = df_dum['Score'].values

#Training Size
train_size = int(X.shape[0]*.7)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]


#Fit Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)


#Send predictions back to dataframe
Z = model.predict(X_test)
zz = model.predict_proba(X_test)

df.loc[train_size:,'predictions']=Z
dfpredictions = df.dropna(subset=['predictions'])

print(dfpredictions)
    Name  Attempts  Score Category  predictions
14  matt        18      1        a          1.0
15  matt        15      6        a          1.0
16  matt        17      3        a          1.0
17  matt        14      7        c          1.0
18  matt        16      6        b          1.0
19  matt        10      2        b          1.0

Now I have new data which i would like to predict:

newdata = [['tom', 10,'a'], ['tom', 15,'a'], ['tom', 14,'a']]

newdf = pd.DataFrame(newdata, columns = ['Name', 'Attempts','Category']) 

print(newdf)

 Name  Attempts Category
0  tom        10        a
1  tom        15        a
2  tom        14        a

Then create dummies and run prediction

newpredict = pd.get_dummies(newdf)

predict = model.predict(newpredict)

Output:

ValueError: X has 3 features per sample; expecting 6

Which makes sense because there are no categories b and c and no name called matt.

My question is how is the best way to set this model up given my new data wont always have the full set of columns used in the original data. Each day i have new data so I'm not quite sure of the most efficient and error free way.

This is an example data - my dataset has 2000 columns when running pd.get_dummies.

finlytics-hub · Accepted Answer

Let me explain Nicolas and BlueSkyz's recommendation a bit more in detail.

pd.get_dummies is useful when you are sure that there will not be any new categories for a specific categorical variable in production/new data set, e.g. Gender, Products, etc. based on your Company or Database's internal data classification/consistency rules.

However, for the majority of machine learning tasks where you can expect to have new categories in the future which were not used in model training, sklearn.OneHotEncoder should be the standard choice. handle_unknown parameter of sklearn.OneHotEncoder can be set to 'ignore' to do just that: ignore new categories when applying the encoder in future. From the documentation:

Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None

The full flow based on LabelEncoding and OneHotEncoding for your example is as below:

# Create a categorical boolean mask
categorical_feature_mask = df.dtypes == object
# Filter out the categorical columns into a list for easy reference later on in case you have more than a couple categorical columns
categorical_cols = df.columns[categorical_feature_mask].tolist()
 
# Instantiate the OneHotEncoder Object
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse = False)
# Apply ohe on data
ohe.fit(df[categorical_cols])
cat_ohe = ohe.transform(df[categorical_cols])

#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(cat_ohe, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe = pd.concat([df, ohe_df], axis=1).drop(columns = categorical_cols, axis=1)

# The following code is for your newdf after training and testing on original df
# Apply ohe on newdf
cat_ohe_new = ohe.transform(newdf[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df_new = pd.DataFrame(cat_ohe_new, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe_new = pd.concat([newdf, ohe_df_new], axis=1).drop(columns = categorical_cols, axis=1)

# predict on df_ohe_new
predict = model.predict(df_ohe_new)

Output (that you can assign back to newdf):

array([1, 1, 1])

However, if you really want to use pd.get_dummies only, then the following can work as well:

newpredict = newpredict.reindex(labels = df_dum.columns, axis = 1, fill_value = 0).drop(columns = ['Score'])
predict = model.predict(newpredict)

The above code snippet will make sure that you have the same columns in your new dummies df (newpredict) as the original df_dum (with 0 values) and drop the 'Score' column. Output here is same as above. This code will ensure that any categorical values present in the new data set but now in the original trained data will be removed while keeping the order of the columns same as that in the original df.

Keep in mind that pd.get_dummies is usually much faster to execute than sklearn.OneHotEncoder.

How to run model on new data that requires pd.get_dummies?

Tags:

python

pandas

scikit-learn

SOK

1 Answers

finlytics-hub

Recent Activity

Donate For Us

How to run model on new data that requires pd.get_dummies?

Tags:

python

pandas

scikit-learn

SOK

1 Answers

finlytics-hub

Related questions

Recent Activity

Donate For Us