Im stuck in a dataset that contains some categrotical features with a high cardinality. like 'item_description' ... I read about some trick called hashing, but its main idea is still blurry and incomprehensible, i also read about a library called 'Feature engine' but i didn't really find something that might solve my issue. Any suggestions please?
Options:
i) Use Target encoding.
More on target encoding : https://maxhalford.github.io/blog/target-encoding/
Good tutorial on categorical variables here: https://www.coursera.org/learn/competitive-data-science#syllabus [Section: Feature Preprocessing and Generation with Respect to Models , 3rd Video]
ii) Use entity embeddings: In short, this technique represent each category by a vector, then training to obtain the characteristics of the category.
Tutorial : https://towardsdatascience.com/deep-learning-structured-data-8d6a278f3088
Notebook implementations:
iii) Use Catboost :
Extra: There is a hashing trick technique which might also be helpful: https://booking.ai/dont-be-tricked-by-the-hashing-trick-192a6aae3087?gi=3045c6e13ee5
This Medium article I wrote might help as well: 4 ways to encode categorical features with high cardinality. It explores four encoding methods applied to a dataset with 26 categorical features with cardinalities up to 40k (includes code):
Target encoding
Count encoding
Feature hashing
Embedding
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With