How to encode a categorical feature with high cardinality?

Question

Im stuck in a dataset that contains some categrotical features with a high cardinality. like 'item_description' ... I read about some trick called hashing, but its main idea is still blurry and incomprehensible, i also read about a library called 'Feature engine' but i didn't really find something that might solve my issue. Any suggestions please?

avvinci · Accepted Answer

Options:

i) Use Target encoding.

More on target encoding : https://maxhalford.github.io/blog/target-encoding/
Good tutorial on categorical variables here: https://www.coursera.org/learn/competitive-data-science#syllabus [Section: Feature Preprocessing and Generation with Respect to Models , 3rd Video]

ii) Use entity embeddings: In short, this technique represent each category by a vector, then training to obtain the characteristics of the category.

Tutorial : https://towardsdatascience.com/deep-learning-structured-data-8d6a278f3088
Notebook implementations:
1. https://www.kaggle.com/aquatic/entity-embedding-neural-net
2. https://www.kaggle.com/abhishek/same-old-entity-embeddings

iii) Use Catboost :

Tutorial : https://www.kaggle.com/mitribunskiy/tutorial-catboost-overview/notebook

Extra: There is a hashing trick technique which might also be helpful: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087?gi=3045c6e13ee5

Aicha BOKBOT · Answer

This Medium article I wrote might help as well: 4 ways to encode categorical features with high cardinality. It explores four encoding methods applied to a dataset with 26 categorical features with cardinalities up to 40k (includes code):

Target encoding

PROS: parameter free; no increase in feature space
CONS: risk of target leakage (target leakage means using some information from target to predict the target itself); when categories have few samples, the target encoder would replace them by values very close to the target which makes the model prone to overfitting the training set; does not accept new values in testing set

Count encoding

PROS: easy to understand and implement; parameter free; no increase in feature space
CONS: risk of information loss when collision happens; can be too simplistic (the only information we keep from the categorical features is their frequency); does not accept new values in testing set

Feature hashing

PROS: limited increase of feature space (as compared to one hot encoding); does not grow in size and accepts new values during inference as it does not maintain a dictionary of observed categories; captures interactions between features when feature hashing is applied on all categorical features combined to create a single hash
CONS: need to tune the parameter of hashing space dimension; risk of collision when the dimension of hashing space is not big enough

Embedding

PROS: limited increase of feature space (as compared to one hot encoding); accepts new values during inference; captures interactions between features and learns the similarities between categories
CONS: need to tune the parameter of embedding size; the embeddings and a logistic regression model cannot be trained synergically in one phase, since logistic regression do not train with backpropagation. Rather, embeddings has to be trained in an initial phase, and then used as static inputs to the decision forest model.

How to encode a categorical feature with high cardinality?

Tags:

python

machine-learning

MBA

2 Answers

avvinci

Aicha BOKBOT

Recent Activity

Donate For Us

How to encode a categorical feature with high cardinality?

Tags:

python

machine-learning

MBA

2 Answers

avvinci

Aicha BOKBOT

Related questions

Recent Activity

Donate For Us