I have a model that runs the following:
import pandas as pd
import numpy as np
# initialize list of lists
data = [['tom', 10,1,'a'], ['tom', 15,5,'a'], ['tom', 14,1,'a'], ['tom', 15,4,'b'], ['tom', 18,1,'b'], ['tom', 15,6,'a'], ['tom', 17,3,'a']
, ['tom', 14,7,'b'], ['tom',16 ,6,'a'], ['tom', 22,2,'a'],['matt', 10,1,'c'], ['matt', 15,5,'b'], ['matt', 14,1,'b'], ['matt', 15,4,'a'], ['matt', 18,1,'a'], ['matt', 15,6,'a'], ['matt', 17,3,'a']
, ['matt', 14,7,'c'], ['matt',16 ,6,'b'], ['matt', 10,2,'b']]
# Create the pandas DataFrame
df = pd.DataFrame(data, columns = ['Name', 'Attempts','Score','Category'])
print(df.head(2))
Name Attempts Score Category
0 tom 10 1 a
1 tom 15 5 a
Then I have created a dummy df to use in the model using the following code:
from sklearn.linear_model import LogisticRegression
df_dum = pd.get_dummies(df)
print(df_dum.head(2))
Attempts Score Name_matt Name_tom Category_a Category_b Category_c
0 10 1 0 1 1 0 0
1 15 5 0 1 1 0 0
Then I have created the following model:
#Model
X = df_dum.drop(('Score'),axis=1)
y = df_dum['Score'].values
#Training Size
train_size = int(X.shape[0]*.7)
X_train = X[:train_size]
X_test = X[train_size:]
y_train = y[:train_size]
y_test = y[train_size:]
#Fit Model
model = LogisticRegression(max_iter=1000)
model.fit(X_train,y_train)
#Send predictions back to dataframe
Z = model.predict(X_test)
zz = model.predict_proba(X_test)
df.loc[train_size:,'predictions']=Z
dfpredictions = df.dropna(subset=['predictions'])
print(dfpredictions)
Name Attempts Score Category predictions
14 matt 18 1 a 1.0
15 matt 15 6 a 1.0
16 matt 17 3 a 1.0
17 matt 14 7 c 1.0
18 matt 16 6 b 1.0
19 matt 10 2 b 1.0
Now I have new data which i would like to predict:
newdata = [['tom', 10,'a'], ['tom', 15,'a'], ['tom', 14,'a']]
newdf = pd.DataFrame(newdata, columns = ['Name', 'Attempts','Category'])
print(newdf)
Name Attempts Category
0 tom 10 a
1 tom 15 a
2 tom 14 a
Then create dummies and run prediction
newpredict = pd.get_dummies(newdf)
predict = model.predict(newpredict)
Output:
ValueError: X has 3 features per sample; expecting 6
Which makes sense because there are no categories b and c and no name called matt.
My question is how is the best way to set this model up given my new data wont always have the full set of columns used in the original data. Each day i have new data so I'm not quite sure of the most efficient and error free way.
This is an example data - my dataset has 2000 columns when running pd.get_dummies.
Let me explain Nicolas and BlueSkyz's recommendation a bit more in detail.
pd.get_dummies is useful when you are sure that there will not be any new categories for a specific categorical variable in production/new data set, e.g. Gender, Products, etc. based on your Company or Database's internal data classification/consistency rules.
However, for the majority of machine learning tasks where you can expect to have new categories in the future which were not used in model training, sklearn.OneHotEncoder should be the standard choice. handle_unknown parameter of sklearn.OneHotEncoder can be set to 'ignore' to do just that: ignore new categories when applying the encoder in future. From the documentation:
Whether to raise an error or ignore if an unknown categorical feature is present during transform (default is to raise). When this parameter is set to ‘ignore’ and an unknown category is encountered during transform, the resulting one-hot encoded columns for this feature will be all zeros. In the inverse transform, an unknown category will be denoted as None
The full flow based on LabelEncoding and OneHotEncoding for your example is as below:
# Create a categorical boolean mask
categorical_feature_mask = df.dtypes == object
# Filter out the categorical columns into a list for easy reference later on in case you have more than a couple categorical columns
categorical_cols = df.columns[categorical_feature_mask].tolist()
# Instantiate the OneHotEncoder Object
from sklearn.preprocessing import OneHotEncoder
ohe = OneHotEncoder(handle_unknown='ignore', sparse = False)
# Apply ohe on data
ohe.fit(df[categorical_cols])
cat_ohe = ohe.transform(df[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df = pd.DataFrame(cat_ohe, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe = pd.concat([df, ohe_df], axis=1).drop(columns = categorical_cols, axis=1)
# The following code is for your newdf after training and testing on original df
# Apply ohe on newdf
cat_ohe_new = ohe.transform(newdf[categorical_cols])
#Create a Pandas DataFrame of the hot encoded column
ohe_df_new = pd.DataFrame(cat_ohe_new, columns = ohe.get_feature_names(input_features = categorical_cols))
#concat with original data and drop original columns
df_ohe_new = pd.concat([newdf, ohe_df_new], axis=1).drop(columns = categorical_cols, axis=1)
# predict on df_ohe_new
predict = model.predict(df_ohe_new)
Output (that you can assign back to newdf):
array([1, 1, 1])
However, if you really want to use pd.get_dummies only, then the following can work as well:
newpredict = newpredict.reindex(labels = df_dum.columns, axis = 1, fill_value = 0).drop(columns = ['Score'])
predict = model.predict(newpredict)
The above code snippet will make sure that you have the same columns in your new dummies df (newpredict) as the original df_dum (with 0 values) and drop the 'Score' column. Output here is same as above. This code will ensure that any categorical values present in the new data set but now in the original trained data will be removed while keeping the order of the columns same as that in the original df.
Keep in mind that pd.get_dummies is usually much faster to execute than sklearn.OneHotEncoder.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With