guys. I was developing an ML model and I got a doubt. Let's assume that my train data has the following data:
If I apply One-hot Encoding, it will generate the following matrix:
That's beautiful and work in most of the cases. But, what if my test set contains fewer (or more) features than the train set? What if my test set doesn't contain "Fish"? It will generate one less category.
Can you guys help me how can I manage this kind of problem?
Thank you
It sounds like you have your train and test sets completely separate. Here's a minimal example of how you might automatically add "missing" features to a given dataset:
import pandas as pd
# Made-up training dataset
train = pd.DataFrame({'animal': ['cat', 'cat', 'dog', 'dog', 'fish', 'fish', 'bear'],
'age': [12, 13, 31, 12, 12, 32, 90]})
# Made-up test dataset (notice how two classes are from train are missing entirely)
test = pd.DataFrame({'animal': ['fish', 'fish', 'dog'],
'age': [15, 62, 1]})
# Discrete column to be one-hot-encoded
col = 'animal'
# Create dummy variables for each level of `col`
train_animal_dummies = pd.get_dummies(train[col], prefix=col)
train = train.join(train_animal_dummies)
test_animal_dummies = pd.get_dummies(test[col], prefix=col)
test = test.join(test_animal_dummies)
# Find the difference in columns between the two datasets
# This will work in trivial case, but if you want to limit to just one feature
# use this: f = lambda c: col in c; feature_difference = set(filter(f, train)) - set(filter(f, test))
feature_difference = set(train) - set(test)
# create zero-filled matrix where the rows are equal to the number
# of row in `test` and columns equal the number of categories missing (i.e. set difference
# between relevant `train` and `test` columns
feature_difference_df = pd.DataFrame(data=np.zeros((test.shape[0], len(feature_difference))),
columns=list(feature_difference))
# add "missing" features back to `test
test = test.join(feature_difference_df)
test goes from this:
age animal animal_dog animal_fish
0 15 fish 0.0 1.0
1 62 fish 0.0 1.0
2 1 dog 1.0 0.0
To this:
age animal animal_dog animal_fish animal_cat animal_bear
0 15 fish 0.0 1.0 0.0 0.0
1 62 fish 0.0 1.0 0.0 0.0
2 1 dog 1.0 0.0 0.0 0.0
Assuming each row (each animal) can only be one animal, it's fine for us to add an animal_bear feature (a sort-of "is-a-bear" test/feature) because of the assumption that if there were any bears in test, that information would have been accounted for in the animal column.
As a rule of thumb, it's a good idea to try to account for all possible features (i.e. all possible values of animal, for example) when building/training a model. As mentioned in the comments, some methods are better at handling missing data than others, but if you can do it all from the outset, that's probably a good idea. Now, that would be tough to do if you're accepting free-text input (as the number of possible inputs is never-ending).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With