I'm currently using xgb.train(...) which returns a booster but I'd like to use RFE to select the best 100 features.  The returned booster cannot be used in RFE as it's not a sklearn estimator.  XGBClassifier is the sklearn api into the xgboost library, however, I am not able to get the same results as with the xgb.train(...) method (10% worse on roc-auc).  I've tried the sklearn boosters but they're not able to get similar results either.  I've also tried to wrap the xgb.train(...) method in a class to add sklearn estimator methods but there's just too many to change.  Is there some way to use the xgb.train(...) along with RFE from sklearn?
XGBoost does (1) for you. XGBoost does not do (2)/(3) for you. So you still have to do feature engineering yourself. Only a deep learning model could replace feature extraction for you.
XGBoost may assume that encoded integer values for each input variable have an ordinal relationship. For example that 'left-up' encoded as 0 and 'left-low' encoded as 1 for the breast-quad variable have a meaningful relationship as integers.
Strictly speaking, tree-based methods do not require explicit data standardisation. XGBoost with a tree base learner would not therefore require this kind of preprocessing.
Boosting is a technique in machine learning that has been shown to produce models with high predictive accuracy. One of the most common ways to implement boosting in practice is to use XGBoost, short for “extreme gradient boosting.” This tutorial provides a step-by-step example of how to use XGBoost to fit a boosted model in R.
There is a technique called the Gradient Boosted Trees whose base learner is CART (Classification and Regression Trees). XGBoost is an implementation of Gradient Boosted decision trees. XGBoost models majorly dominate in many Kaggle Competitions.
XGBoost can be installed as a standalone library and an XGBoost model can be developed using the scikit-learn API. The first step is to install the XGBoost library if it is not already installed. This can be achieved using the pip python package manager on most platforms; for example: sudo pip install xgboost
Scikit-Learn API: It is a Scikit-Learn wrapper interface for XGBoost. It allows using XGBoost in a scikit-learn compatible way, the same way you would use any native scikit-learn model. Note that when using the Learning API you can input and access an evaluation metric, whereas when using the Scikit-learn API you have to calculate it.
For this kind of problem, I created shap-hypetune: a python package for simultaneous Hyperparameters Tuning and Features Selection for Gradient Boosting Models
In your case, this enables you to perform RFE with XGBClassifier in a very simple and intuitive way:
from shaphypetune import BoostRFE
model = BoostRFE(XGBClassifier(), min_features_to_select=1, step=1)
model.fit(X_train, y_train, eval_set=[(X_valid, y_valid)], early_stopping_rounds=6, verbose=0)
pred = model.predict(X_test)
As you can see, you can use all the fitting options available in the standard XGB API, like early_stopping_rounds or custom metrics, to customize the training process.
You can use shap-hypetune also to compute parameter tuning (also simultaneously with feature selection) or to compute feature selection with RFE or Boruta using SHAP feature importance. Full example available here
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With