I'm using shap.utils.hclust to figure out which features are redundant and following the documentation
Reproducible example:
import pandas as pd
import numpy as np
import shap
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score
from lightgbm import LGBMClassifier
data = pd.read_csv("https://raw.githubusercontent.com/gdmarmerola/random-stuff/master/probability_calibration/UCI_Credit_Card.csv")
# getting design matrix and target
X = data.copy().drop(['ID','default.payment.next.month'], axis=1)
y = data.copy()['default.payment.next.month']
X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = .2, random_state = 42)
model = LGBMClassifier(random_state = 42).fit(X_train, y_train)
# compute SHAP values
explainer = shap.Explainer(model, X_test)
shap_values = explainer(X_test)
clustering = shap.utils.hclust(X_test, y_test) # by default this trains (X.shape[1] choose 2) 2-feature XGBoost models
shap.plots.bar(shap_values, clustering=clustering)
It retrieves the following plot:

My questions are:
In the implementation why is this an XGBRegressor even for classification tasks?
How can I use clusteringto remove redundant features, beyond the bar plot visual inspection?
UPDATE:
My main question is:
In this toy example, how can I use the (22,4) shape output matrix to check which features are in the same cluster and thus being able to reduce the dimensionality? I have a data frame with more than 10,000 features, that's why visual inspection is not feasible.
array([[15. , 16. , 0.2877219 , 2. ],
[12. , 13. , 0.36595157, 2. ],
[11. , 24. , 0.37372008, 3. ],
[14. , 25. , 0.420607 , 4. ],
[23. , 26. , 0.43781072, 6. ],
[ 9. , 10. , 0.45111704, 2. ],
[21. , 27. , 0.50203449, 7. ],
[20. , 29. , 0.51782125, 8. ],
[18. , 30. , 0.52462131, 9. ],
[17. , 31. , 0.52700263, 10. ],
[ 8. , 28. , 0.52802497, 3. ],
[19. , 32. , 0.54064447, 11. ],
[ 5. , 6. , 0.56145751, 2. ],
[ 7. , 33. , 0.57828146, 4. ],
[35. , 36. , 0.62561315, 6. ],
[34. , 37. , 0.66345358, 17. ],
[22. , 38. , 0.6892271 , 18. ],
[ 0. , 39. , 0.76330948, 19. ],
[ 4. , 40. , 0.91275334, 20. ],
[ 2. , 41. , 0.94387454, 21. ],
[ 1. , 42. , 0.98299891, 22. ],
[ 3. , 43. , 0.98913395, 23. ]])
Distances are measured by training univariate XGBoost models of y for all the features, and then predicting the output of these models using univariate XGBoost models of other features. If one feature can effectively predict the output of another feature's univariate XGBoost model of y, then the second feature is redundant with the first with respect to y. A distance of 1 corresponds to no redundancy while a distance of 0 corresponds to perfect redundancy (measured using the proportion of variance explained). Note these distances are not symmetric.
In plain English, they add features 1 by 1 and see if the added feature adds predictive power in raw space, measured by R2. If yes, they say distance is big (different cluster). If not, they say distance is small (same cluster).
In general, I would use SHAP for model or data explanation ("true to model" or "true to data"), and experiment planning. I wouldn't expect it help me with feature selection while tuning models with already collected data. But it may depend on your particular objective, e.g. output most parsimonious model or adjust your data if you suspect overfitting.
UPDATE
from scipy.cluster import hierarchy
hierarchy.dendrogram(clustering, labels=X_train.columns);

hierarchy.cut_tree(clustering, n_clusters=5)
or
hierarchy.cut_tree(clustering, height=.5)
Output is cluster labels. Choose "any".
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With