Machine Learning - test set with fewer features than the train set

Question

guys. I was developing an ML model and I got a doubt. Let's assume that my train data has the following data:

ID | Animal | Age | Habitat

0 | Fish | 2 | Sea

1 | Hawk | 1 | Mountain

2 | Fish | 3 | Sea

3 | Snake | 4 | Forest

If I apply One-hot Encoding, it will generate the following matrix:

ID | Animal_Fish | Animal_Hawk | Animal_Snake | Age | ...

0 | 1 | 0 | 0 | 2 | ...

1 | 0 | 1 | 0 | 1 | ...

2 | 1 | 0 | 0 | 3 | ...

3 | 0 | 0 | 1 | 4 | ...

That's beautiful and work in most of the cases. But, what if my test set contains fewer (or more) features than the train set? What if my test set doesn't contain "Fish"? It will generate one less category.

Can you guys help me how can I manage this kind of problem?

Thank you

blacksite · Accepted Answer

It sounds like you have your train and test sets completely separate. Here's a minimal example of how you might automatically add "missing" features to a given dataset:

import pandas as pd

# Made-up training dataset
train = pd.DataFrame({'animal': ['cat', 'cat', 'dog', 'dog', 'fish', 'fish', 'bear'],
                      'age': [12, 13, 31, 12, 12, 32, 90]})

# Made-up test dataset (notice how two classes are from train are missing entirely)
test = pd.DataFrame({'animal': ['fish', 'fish', 'dog'],
                      'age': [15, 62, 1]})

# Discrete column to be one-hot-encoded
col = 'animal'

# Create dummy variables for each level of `col`
train_animal_dummies = pd.get_dummies(train[col], prefix=col)
train = train.join(train_animal_dummies)

test_animal_dummies = pd.get_dummies(test[col], prefix=col)
test = test.join(test_animal_dummies)

# Find the difference in columns between the two datasets
# This will work in trivial case, but if you want to limit to just one feature
# use this: f = lambda c: col in c; feature_difference = set(filter(f, train)) - set(filter(f, test))
feature_difference = set(train) - set(test)

# create zero-filled matrix where the rows are equal to the number
# of row in `test` and columns equal the number of categories missing (i.e. set difference 
# between relevant `train` and `test` columns
feature_difference_df = pd.DataFrame(data=np.zeros((test.shape[0], len(feature_difference))),
                                     columns=list(feature_difference))

# add "missing" features back to `test
test = test.join(feature_difference_df)

test goes from this:

   age animal  animal_dog  animal_fish
0   15   fish         0.0          1.0
1   62   fish         0.0          1.0
2    1    dog         1.0          0.0

To this:

   age animal  animal_dog  animal_fish  animal_cat  animal_bear
0   15   fish         0.0          1.0         0.0          0.0
1   62   fish         0.0          1.0         0.0          0.0
2    1    dog         1.0          0.0         0.0          0.0

Assuming each row (each animal) can only be one animal, it's fine for us to add an animal_bear feature (a sort-of "is-a-bear" test/feature) because of the assumption that if there were any bears in test, that information would have been accounted for in the animal column.

As a rule of thumb, it's a good idea to try to account for all possible features (i.e. all possible values of animal, for example) when building/training a model. As mentioned in the comments, some methods are better at handling missing data than others, but if you can do it all from the outset, that's probably a good idea. Now, that would be tough to do if you're accepting free-text input (as the number of possible inputs is never-ending).

Machine Learning - test set with fewer features than the train set

Tags:

python

machine-learning

ID | Animal | Age | Habitat

0 | Fish | 2 | Sea

1 | Hawk | 1 | Mountain

2 | Fish | 3 | Sea

3 | Snake | 4 | Forest

ID | Animal_Fish | Animal_Hawk | Animal_Snake | Age | ...

0 | 1 | 0 | 0 | 2 | ...

1 | 0 | 1 | 0 | 1 | ...

2 | 1 | 0 | 0 | 3 | ...

3 | 0 | 0 | 1 | 4 | ...

Paulo Henrique Vasconcellos

1 Answers

blacksite

Recent Activity

Donate For Us

Machine Learning - test set with fewer features than the train set

Tags:

python

machine-learning

ID | Animal | Age | Habitat

0 | Fish | 2 | Sea

1 | Hawk | 1 | Mountain

2 | Fish | 3 | Sea

3 | Snake | 4 | Forest

ID | Animal_Fish | Animal_Hawk | Animal_Snake | Age | ...

0 | 1 | 0 | 0 | 2 | ...

1 | 0 | 1 | 0 | 1 | ...

2 | 1 | 0 | 0 | 3 | ...

3 | 0 | 0 | 1 | 4 | ...

Paulo Henrique Vasconcellos

1 Answers

blacksite

Related questions

Recent Activity

Donate For Us