Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to encode a categorical feature with high cardinality?

Im stuck in a dataset that contains some categrotical features with a high cardinality. like 'item_description' ... I read about some trick called hashing, but its main idea is still blurry and incomprehensible, i also read about a library called 'Feature engine' but i didn't really find something that might solve my issue. Any suggestions please?

like image 891
MBA Avatar asked Oct 27 '25 10:10

MBA


2 Answers

Options:

i) Use Target encoding.

  • More on target encoding : https://maxhalford.github.io/blog/target-encoding/

  • Good tutorial on categorical variables here: https://www.coursera.org/learn/competitive-data-science#syllabus [Section: Feature Preprocessing and Generation with Respect to Models , 3rd Video]

ii) Use entity embeddings: In short, this technique represent each category by a vector, then training to obtain the characteristics of the category.

  • Tutorial : https://towardsdatascience.com/deep-learning-structured-data-8d6a278f3088

  • Notebook implementations:

    1. https://www.kaggle.com/aquatic/entity-embedding-neural-net
    2. https://www.kaggle.com/abhishek/same-old-entity-embeddings

iii) Use Catboost :

  • Tutorial : https://www.kaggle.com/mitribunskiy/tutorial-catboost-overview/notebook

Extra: There is a hashing trick technique which might also be helpful: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087?gi=3045c6e13ee5

like image 166
avvinci Avatar answered Oct 30 '25 01:10

avvinci


This Medium article I wrote might help as well: 4 ways to encode categorical features with high cardinality. It explores four encoding methods applied to a dataset with 26 categorical features with cardinalities up to 40k (includes code):

Target encoding

  • PROS: parameter free; no increase in feature space
  • CONS: risk of target leakage (target leakage means using some information from target to predict the target itself); when categories have few samples, the target encoder would replace them by values very close to the target which makes the model prone to overfitting the training set; does not accept new values in testing set

Count encoding

  • PROS: easy to understand and implement; parameter free; no increase in feature space
  • CONS: risk of information loss when collision happens; can be too simplistic (the only information we keep from the categorical features is their frequency); does not accept new values in testing set

Feature hashing

  • PROS: limited increase of feature space (as compared to one hot encoding); does not grow in size and accepts new values during inference as it does not maintain a dictionary of observed categories; captures interactions between features when feature hashing is applied on all categorical features combined to create a single hash
  • CONS: need to tune the parameter of hashing space dimension; risk of collision when the dimension of hashing space is not big enough

Embedding

  • PROS: limited increase of feature space (as compared to one hot encoding); accepts new values during inference; captures interactions between features and learns the similarities between categories
  • CONS: need to tune the parameter of embedding size; the embeddings and a logistic regression model cannot be trained synergically in one phase, since logistic regression do not train with backpropagation. Rather, embeddings has to be trained in an initial phase, and then used as static inputs to the decision forest model.
like image 37
Aicha BOKBOT Avatar answered Oct 29 '25 23:10

Aicha BOKBOT



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!