I am trying to use sample_weight
in XGBClassifier
to improve the performance of one of our models.
However, it seems like the sample_weight
parameter is not working as expected. sample_weight
is very important for this problem. Please see my code below.
Basically the fitting of the model does not seem to take into account the sample_weight
parameter – it starts at an AUC of 0.5 and drops from there, recommending 0, or 1 n_estimators
. There is nothing wrong with the underlying data – we have constructed a very good model using sample weights using another tool, getting a good Gini.
The sample data provided does not properly exhibit this behavior but given a consistent random seed throughout we can see that the model objects are identical whether a weight
/sample_weight
is provided or not.
I have tried different components from the xbgoost library that similarly have parameters where one can define weights, but no luck:
XGBClassifier.fit()
XGBClassifier.train()
Xgboost()
XGB.fit()
XGB.train()
Dmatrix()
XGBGridSearchCV()
I have also tried the fit_params=fit_params
as a parameter as well as weight=weight
and sample_weight=sample_weight
variations
Code:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier
df = pd.DataFrame(columns =
['GB_FLAG','sample_weight','f1','f2','f3','f4','f5'])
df.loc[0] = [0,1,2046,10,625,8000,2072]
df.loc[1] = [0,0.86836,8000,10,705,8800,28]
df.loc[2] = [1,1,2303.62,19,674,3000,848]
df.loc[3] = [0,0,2754.8,2,570,16300,46]
df.loc[4] = [1,0.103474,11119.81,6,0,9500,3885]
df.loc[5] = [1,0,1050.83,19,715,3000,-5]
df.loc[6] = [1,0.011098,7063.35,11,713,19700,486]
df.loc[7] = [0,0.972176,6447.16,18,681,11300,1104]
df.loc[8] = [1,0.054237,7461.27,18,0,0,4]
df.loc[9] = [0,0.917026,4600.83,8,0,10400,242]
df.loc[10] = [0,0.670026,2041.8,21,716,11000,3]
df.loc[11] = [1,0.112416,2413.77,22,750,4600,271]
df.loc[12] = [0,0,251.81,17,806,3800,0]
df.loc[13] = [1,0.026263,20919.2,17,684,8100,1335]
df.loc[14] = [0,1,1504.58,15,621,6800,461]
df.loc[15] = [0,0.654429,9227.69,4,0,22500,294]
df.loc[16] = [0,0.897051,6960.31,22,674,5400,188]
df.loc[17] = [1,0.209862,4481.42,18,745,11600,0]
df.loc[18] = [0,1,2692.96,22,651,12800,2035]
y = np.asarray(df['GB_FLAG'])
X = np.asarray(df.drop(['GB_FLAG'], axis=1))
X_traintest, X_valid, y_traintest, y_valid = train_test_split(X, y,
train_size=0.7, stratify=y, random_state=1337)
traintest_sample_weight = X_traintest[:,0]
valid_sample_weight = X_valid[:,0]
X_traintest = X_traintest[:,1:]
X_valid = X_valid[:,1:]
model = XGBClassifier()
eval_set = [(X_valid, y_valid)]
model.fit(X_traintest, y_traintest, eval_set=eval_set, eval_metric="auc", e
early_stopping_rounds=50, verbose = True, sample_weight =
traintest_sample_weight)
How do I use sample weights when using xgboost
for modeling?
The problem is that for evaluation datasets weights are not propagated by the sklearn API.
So you seem to be doomed to use the native API. Just replace the lines starting with your model
definition by the following code:
from xgboost import train, DMatrix
trainDmatrix = DMatrix(X_traintest, label=y_traintest, weight=traintest_sample_weight)
validDmatrix = DMatrix(X_valid, label=y_valid, weight=valid_sample_weight)
booster = train({'eval_metric': 'auc'}, trainDmatrix, num_boost_round=100,
evals=[(trainDmatrix,'train'), (validDmatrix, 'valid')], early_stopping_rounds=50,
verbose_eval=10)
UPD: The xgboost community is aware of it and there is a discussion and even a PR for it: https://github.com/dmlc/xgboost/issues/1804. However, this was never propagated to v0.71 for some reason.
UPD2: After pinging that issue, the relevant code update has been revived and the PR was merged into master in time for the upcoming xgboost 0.72
release on 1 June 2018
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With