Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to filter redundant features using shap.utils.hclust not only by visual inspection barplot?

I'm using shap.utils.hclust to figure out which features are redundant and following the documentation

Reproducible example:

import pandas as pd
import numpy as np
import shap

from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score

from lightgbm import LGBMClassifier

data = pd.read_csv("https://raw.githubusercontent.com/gdmarmerola/random-stuff/master/probability_calibration/UCI_Credit_Card.csv")

# getting design matrix and target
X = data.copy().drop(['ID','default.payment.next.month'], axis=1)
y = data.copy()['default.payment.next.month']

X_train, X_test, y_train, y_test = train_test_split(X,y,test_size = .2, random_state = 42)

model = LGBMClassifier(random_state = 42).fit(X_train, y_train)

# compute SHAP values
explainer = shap.Explainer(model, X_test)
shap_values = explainer(X_test)

clustering = shap.utils.hclust(X_test, y_test) # by default this trains (X.shape[1] choose 2) 2-feature XGBoost models
shap.plots.bar(shap_values, clustering=clustering)

It retrieves the following plot:

enter image description here

My questions are:

  1. In the implementation why is this an XGBRegressor even for classification tasks?

  2. How can I use clusteringto remove redundant features, beyond the bar plot visual inspection?

UPDATE:

My main question is:

In this toy example, how can I use the (22,4) shape output matrix to check which features are in the same cluster and thus being able to reduce the dimensionality? I have a data frame with more than 10,000 features, that's why visual inspection is not feasible.

array([[15.        , 16.        ,  0.2877219 ,  2.        ],
       [12.        , 13.        ,  0.36595157,  2.        ],
       [11.        , 24.        ,  0.37372008,  3.        ],
       [14.        , 25.        ,  0.420607  ,  4.        ],
       [23.        , 26.        ,  0.43781072,  6.        ],
       [ 9.        , 10.        ,  0.45111704,  2.        ],
       [21.        , 27.        ,  0.50203449,  7.        ],
       [20.        , 29.        ,  0.51782125,  8.        ],
       [18.        , 30.        ,  0.52462131,  9.        ],
       [17.        , 31.        ,  0.52700263, 10.        ],
       [ 8.        , 28.        ,  0.52802497,  3.        ],
       [19.        , 32.        ,  0.54064447, 11.        ],
       [ 5.        ,  6.        ,  0.56145751,  2.        ],
       [ 7.        , 33.        ,  0.57828146,  4.        ],
       [35.        , 36.        ,  0.62561315,  6.        ],
       [34.        , 37.        ,  0.66345358, 17.        ],
       [22.        , 38.        ,  0.6892271 , 18.        ],
       [ 0.        , 39.        ,  0.76330948, 19.        ],
       [ 4.        , 40.        ,  0.91275334, 20.        ],
       [ 2.        , 41.        ,  0.94387454, 21.        ],
       [ 1.        , 42.        ,  0.98299891, 22.        ],
       [ 3.        , 43.        ,  0.98913395, 23.        ]])
like image 278
Multivac Avatar asked Oct 28 '25 03:10

Multivac


1 Answers

  1. Underneath, even tree models for classification are regression tasks. SHAP calls it "raw" feature output space, Tensorflow would call it logits. To convert raw to proba space sigmoid or softmax are used. So, answering your first question:
Distances are measured by training univariate XGBoost models 
of y for all the features, and then predicting the output of these
models using univariate XGBoost models of other features. If one 
feature can effectively predict the output of another feature's 
univariate XGBoost model of y, then the second feature is 
redundant with the first with respect to y. A distance of 1 corresponds 
to no redundancy while a distance of 0 corresponds to perfect 
redundancy (measured using the proportion of variance explained). 
Note these distances are not symmetric.

In plain English, they add features 1 by 1 and see if the added feature adds predictive power in raw space, measured by R2. If yes, they say distance is big (different cluster). If not, they say distance is small (same cluster).

  1. The only info you can infer from the clustering is that features are pretty close from the model's predictive ability perspective, given the sample. As a rule, omitting additional feature will be lossy unless it's (i) a white noise, (ii) has structure but uncorrelated in broad sense with the output, or (iii) model memorizes data. Different model may have a different opinion if it's in the same "cluster" or not.

In general, I would use SHAP for model or data explanation ("true to model" or "true to data"), and experiment planning. I wouldn't expect it help me with feature selection while tuning models with already collected data. But it may depend on your particular objective, e.g. output most parsimonious model or adjust your data if you suspect overfitting.

UPDATE

from scipy.cluster import hierarchy
hierarchy.dendrogram(clustering, labels=X_train.columns);

enter image description here

hierarchy.cut_tree(clustering, n_clusters=5)

or

hierarchy.cut_tree(clustering, height=.5)

Output is cluster labels. Choose "any".

like image 58
Sergey Bushmanov Avatar answered Oct 29 '25 19:10

Sergey Bushmanov