I am trying to find the best parameters for a lightgbm model using GridSearchCV from sklearn.model_selection. I have not been able to find a solution that actually works.
I have managed to set up a partly working code:
import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import KFold
np.random.seed(1)
train = pd.read_csv('train.csv')
test = pd.read_csv('test.csv')
y = pd.read_csv('y.csv')
y = y.values.ravel()
print(train.shape, test.shape, y.shape)
categoricals = ['COL_A','COL_B']
indexes_of_categories = [train.columns.get_loc(col) for col in categoricals]
gkf = KFold(n_splits=5, shuffle=True, random_state=42).split(X=train, y=y)
param_grid = {
    'num_leaves': [31, 127],
    'reg_alpha': [0.1, 0.5],
    'min_data_in_leaf': [30, 50, 100, 300, 400],
    'lambda_l1': [0, 1, 1.5],
    'lambda_l2': [0, 1]
    }
lgb_estimator = lgb.LGBMClassifier(boosting_type='gbdt',  objective='binary', num_boost_round=2000, learning_rate=0.01, metric='auc',categorical_feature=indexes_of_categories)
gsearch = GridSearchCV(estimator=lgb_estimator, param_grid=param_grid, cv=gkf)
lgb_model = gsearch.fit(X=train, y=y)
print(lgb_model.best_params_, lgb_model.best_score_)
This seems to be working but with a UserWarning: 
categorical_featurekeyword has been found inparamsand will be ignored. Please usecategorical_featureargument of the Dataset constructor to pass this parameter.
I am looking for a working solution or perhaps a suggestion on how to ensure that lightgbm accepts categorical arguments in the above code
Brief Overview of Grid Search 1 — Prepare the database. 2 —Identify the model's hyperparameters to optimize, and then we select the hyperparameter values that we want to test. 3 — Asses error score for each combination in the hyperparameter grid. 4 — Select the hyperparameter combination with the best error metric.
Grid search is the simplest algorithm for hyperparameter tuning. Basically, we divide the domain of the hyperparameters into a discrete grid. Then, we try every combination of values of this grid, calculating some performance metrics using cross-validation.
Cross-Validation and GridSearchCVIn GridSearchCV, along with Grid Search, cross-validation is also performed. Cross-Validation is used while training the model. As we know that before training the model with data, we divide the data into two parts – train data and test data.
The grid. best_score_ is the average of all cv folds for a single combination of the parameters you specify in the tuned_params . In order to access other relevant details about the grid searching process, you can look at the grid.
As the warning states, categorical_feature is not one of the LGBMModel arguments. It is relevant in lgb.Dataset instantiation, which in the case of sklearn API is done directly in the fit() method see the doc. Thus, in order to pass those in the GridSearchCV optimisation one has to provide it as an argument of the GridSearchCV.fit() method in the case of sklearn v0.19.1 or as an additional fit_params argument in GridSearchCV instantiation in older sklearn versions
In case you are struggling with how to pass the fit_params, which happened to me as well, this is how you should do that:
fit_params = {'categorical_feature':indexes_of_categories}
clf = GridSearchCV(model, param_grid, cv=n_folds)
clf.fit(x_train, y_train, **fit_params)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With