I'm following this tutorial on data camp and one thing they mentioned was utilizing your GPU for faster processing times. They even go as far to say it is "blazing fast".
I, however, am seeing the opposite results. For the code block below, with 10k boosts, I am see ~30 seconds with "hist" passed in my params vs just over a minute with "gpu_hist" passed with my params.
Usage on my GPU caps out at 40% when using "gpu_hist" and 100% on all 24 logical cores when using "hist"
params = {"objective": "reg:squarederror", "tree_method": "gpu_hist", "subsample": 0.8,
"colsample_bytree": 0.8}
evals = [(dtrain_reg, "train"),(dtest_reg, "validation")]
n = 10000
model = xgb.train(
params=params,
dtrain=dtrain_reg,
num_boost_round=n,
evals=evals,
verbose_eval=50,
)
I'm trying to run this in VSCode in a jupyter notebook.
I've asked chatGPT, searched online, even asked a mentor in the course I'm taking but have not been able to diagnose why it's taking so much longer with "gpu_hist" vs just "hist".
There is another similar question here in Stack Overflow from 4 months ago that has zero responses.
It's been a while but maybe a reply is warranted since I spent some time figuring this out. Try this code:
from sklearn.datasets import fetch_california_housing
import xgboost as xgb
# Fetch dataset using sklearn
data = fetch_california_housing()
X = data.data
y = data.target
num_round = 1000
param = {
"eta": 0.05,
"max_depth": 10,
"tree_method": "hist",
"device": "cpu", # switch between "cpu" and "GPU"
"nthread": 24, # increase this to increase CPU threads
"seed": 42
}
# GPU accelerated or CPU multicore training
dtrain = xgb.DMatrix(X, label=y, feature_names=data.feature_names)
model = xgb.train(param, dtrain, num_round)
On 24 threads I am getting:
CPU times: user 1min 9s, sys: 43.7 ms, total: 1min 9s. Wall time: 2.95 s
On 32 threads I am getting:
CPU times: user 1min 40s, sys: 33.8 ms, total: 1min 40s Wall time: 3.19 s
With a GPU I am getting:
CPU times: user 6.47 s, sys: 9.98 ms, total: 6.48 s Wall time: 5.96 s
XGBoost does not scale too well with increasing parallelism. Sending to a GPU can actually increase training time for XGBoost
On the other hand, if you need to compute feature importance with SHAP
import shap
model.set_param({"device": "gpu"}) # "cpu" or "GPU"
shap_values = model.predict(dtrain, pred_contribs=True)
On 32 threads I am getting:
CPU times: user 43min 43s, sys: 54.2 ms, total: 43min 43s Wall time: 1min 23s
With a GPU I am getting:
CPU times: user 3.06 s, sys: 28 ms, total: 3.09 s Wall time: 3.09 s
So SHAP computation is highly accelerated on a GPU.
There are more examples of XGBoost on CPU vs. GPU here
The GPUTreeShap paper is here
This is on NVIDIA GeForce RTX 3090/AMD Ryzen 9 5950X 16-Core
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With