Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using sample weights for training xgboost (0.7) classifier

I am trying to use sample_weight in XGBClassifier to improve the performance of one of our models.

However, it seems like the sample_weight parameter is not working as expected. sample_weight is very important for this problem. Please see my code below.

Basically the fitting of the model does not seem to take into account the sample_weight parameter – it starts at an AUC of 0.5 and drops from there, recommending 0, or 1 n_estimators. There is nothing wrong with the underlying data – we have constructed a very good model using sample weights using another tool, getting a good Gini.

The sample data provided does not properly exhibit this behavior but given a consistent random seed throughout we can see that the model objects are identical whether a weight/sample_weight is provided or not.

I have tried different components from the xbgoost library that similarly have parameters where one can define weights, but no luck:

XGBClassifier.fit()
XGBClassifier.train()
Xgboost()
XGB.fit()
XGB.train()
Dmatrix()
XGBGridSearchCV()

I have also tried the fit_params=fit_params as a parameter as well as weight=weight and sample_weight=sample_weight variations

Code:

import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from xgboost import XGBClassifier

df = pd.DataFrame(columns = 
['GB_FLAG','sample_weight','f1','f2','f3','f4','f5'])
df.loc[0] = [0,1,2046,10,625,8000,2072]
df.loc[1] = [0,0.86836,8000,10,705,8800,28]
df.loc[2] = [1,1,2303.62,19,674,3000,848]
df.loc[3] = [0,0,2754.8,2,570,16300,46]
df.loc[4] = [1,0.103474,11119.81,6,0,9500,3885]
df.loc[5] = [1,0,1050.83,19,715,3000,-5]
df.loc[6] = [1,0.011098,7063.35,11,713,19700,486]
df.loc[7] = [0,0.972176,6447.16,18,681,11300,1104]
df.loc[8] = [1,0.054237,7461.27,18,0,0,4]
df.loc[9] = [0,0.917026,4600.83,8,0,10400,242]
df.loc[10] = [0,0.670026,2041.8,21,716,11000,3]
df.loc[11] = [1,0.112416,2413.77,22,750,4600,271]
df.loc[12] = [0,0,251.81,17,806,3800,0]
df.loc[13] = [1,0.026263,20919.2,17,684,8100,1335]
df.loc[14] = [0,1,1504.58,15,621,6800,461]
df.loc[15] = [0,0.654429,9227.69,4,0,22500,294]
df.loc[16] = [0,0.897051,6960.31,22,674,5400,188]
df.loc[17] = [1,0.209862,4481.42,18,745,11600,0]
df.loc[18] = [0,1,2692.96,22,651,12800,2035]

y = np.asarray(df['GB_FLAG'])
X = np.asarray(df.drop(['GB_FLAG'], axis=1))

X_traintest, X_valid, y_traintest, y_valid = train_test_split(X, y, 
train_size=0.7, stratify=y, random_state=1337)
traintest_sample_weight = X_traintest[:,0]
valid_sample_weight = X_valid[:,0]

X_traintest = X_traintest[:,1:]
X_valid = X_valid[:,1:]

model = XGBClassifier()
eval_set = [(X_valid, y_valid)]
model.fit(X_traintest, y_traintest, eval_set=eval_set, eval_metric="auc", e 
early_stopping_rounds=50, verbose = True, sample_weight = 
traintest_sample_weight)

How do I use sample weights when using xgboost for modeling?

like image 741
Ernie Halberg Avatar asked Oct 17 '25 15:10

Ernie Halberg


1 Answers

The problem is that for evaluation datasets weights are not propagated by the sklearn API.

So you seem to be doomed to use the native API. Just replace the lines starting with your model definition by the following code:

from xgboost import train, DMatrix
trainDmatrix = DMatrix(X_traintest, label=y_traintest, weight=traintest_sample_weight)
validDmatrix = DMatrix(X_valid, label=y_valid, weight=valid_sample_weight)
booster = train({'eval_metric': 'auc'}, trainDmatrix, num_boost_round=100, 
                evals=[(trainDmatrix,'train'), (validDmatrix, 'valid')], early_stopping_rounds=50, 
                verbose_eval=10)

UPD: The xgboost community is aware of it and there is a discussion and even a PR for it: https://github.com/dmlc/xgboost/issues/1804. However, this was never propagated to v0.71 for some reason.

UPD2: After pinging that issue, the relevant code update has been revived and the PR was merged into master in time for the upcoming xgboost 0.72 release on 1 June 2018

like image 147
Mischa Lisovyi Avatar answered Oct 20 '25 08:10

Mischa Lisovyi