I am trying to run a Python notebook (link). At line below In [446]: where author train XGBoost, I am getting an error
ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields StateHoliday, Assortment
# XGB with xgboost library
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest = xgb.DMatrix(X_test[predictors], y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, 300, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
Here is the minimal code for testing
import pickle
import numpy as np
import xgboost as xgb
from sklearn.model_selection import train_test_split
with open('train_store', 'rb') as f:
    train_store = pickle.load(f)
train_store.shape
predictors = ['Store', 'DayOfWeek', 'Open', 'Promo', 'StateHoliday', 'SchoolHoliday', 'Year', 'Month', 'Day', 
              'WeekOfYear', 'StoreType', 'Assortment', 'CompetitionDistance', 'CompetitionOpenSinceMonth', 
              'CompetitionOpenSinceYear', 'Promo2', 'Promo2SinceWeek', 'Promo2SinceYear', 'CompetitionOpen', 
              'PromoOpen']
y = np.log(train_store.Sales) # log transformation of Sales
X = train_store
# split the data into train/test set
X_train, X_test, y_train, y_test = train_test_split(X, y, 
                                                    test_size = 0.3, # 30% for the evaluation set
                                                    random_state = 42)
# base parameters
params = {
    'booster': 'gbtree', 
    'objective': 'reg:linear', # regression task
    'subsample': 0.8,          # 80% of data to grow trees and prevent overfitting
    'colsample_bytree': 0.85,  # 85% of features used
    'eta': 0.1, 
    'max_depth': 10, 
    'seed': 42} # for reproducible results
num_round = 60 # default 300
dtrain = xgb.DMatrix(X_train[predictors], y_train)
dtest  = xgb.DMatrix(X_test[predictors],  y_test)
watchlist = [(dtrain, 'train'), (dtest, 'test')]
xgb_model = xgb.train(params, dtrain, num_round, evals = watchlist,
                      early_stopping_rounds = 50, feval = rmspe_xg, verbose_eval = True)
Link to train_store data file: Link 1
Starting from version 1.5, XGBoost has experimental support for categorical data available for public testing.
Xgboost with label encoding for categorical variablesLabel encoding is used to transform categorical values into numerical values. Split data into training data set and test data set. Tune xgboost hyper-parameters. Train xgboost model with train data set.
"When using XGBoost we need to convert categorical variables into numeric." Not always, no. If booster=='gbtree' (the default), then XGBoost can handle categorical variables encoded as numeric directly, without needing dummifying/one-hotting.
As far as XGBoost is concerned, one-hot-encoding becomes necessary as XGBoost accepts only numeric features.
I met the exactly same issue when i am doing Rossmann Sales Prediction Project. It seems like new version of xgboost do not accept the datatype of StateHoliday, Assortment, and StoreType. you can check the datatype as Mykhailo Lisovyi suggested by using
print(test_train.dtypes)
you need to replace test_train here with your X_train
you might can get
DayOfWeek                      int64
Promo                          int64
StateHoliday                   int64
SchoolHoliday                  int64
StoreType                     object
Assortment                    object
CompetitionDistance          float64
CompetitionOpenSinceMonth    float64
CompetitionOpenSinceYear     float64
Promo2                         int64
Promo2SinceWeek              float64
Promo2SinceYear              float64
Year                           int64
Month                          int64
Day                            int64
the error raised up to object type. You can convert them with
from sklearn import preprocessing
lbl = preprocessing.LabelEncoder()
test_train['StoreType'] = lbl.fit_transform(test_train['StoreType'].astype(str))
test_train['Assortment'] = lbl.fit_transform(test_train['Assortment'].astype(str))
Everything will go well after those steps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With