I want to use LightGBM to predict the tradeMoney of house, but I get troubles when I have specified categorical_feature in the lgb.Dataset of LightGBM.
I get data.dtypes as follows:
type(train)
pandas.core.frame.DataFrame
train.dtypes
area                  float64
rentType               object
houseFloor             object
totalFloor              int64
houseToward            object
houseDecoration        object
region                 object
plate                  object
buildYear               int64
saleSecHouseNum         int64
subwayStationNum        int64
busStationNum           int64
interSchoolNum          int64
schoolNum               int64
privateSchoolNum        int64
hospitalNum             int64
drugStoreNum            int64
And I use LightGBM to train it as follows:
categorical_feats = ['rentType', 'houseFloor', 'houseToward', 'houseDecoration', 'region', 'plate']
folds = KFold(n_splits=5, shuffle=True, random_state=2333)
oof_lgb = np.zeros(len(train))
predictions_lgb = np.zeros(len(test))
feature_importance_df = pd.DataFrame()
for fold_, (trn_idx, val_idx) in enumerate(folds.split(train.values, target.values)):
    print("fold {}".format(fold_))
    trn_data = lgb.Dataset(train.iloc[trn_idx], label=target.iloc[trn_idx], categorical_feature=categorical_feats)
    val_data = lgb.Dataset(train.iloc[val_idx], label=target.iloc[val_idx], categorical_feature=categorical_feats)
    num_round = 10000
    clf = lgb.train(params, trn_data, num_round, valid_sets = [trn_data, val_data], verbose_eval=500, early_stopping_rounds = 200)
    oof_lgb[val_idx] = clf.predict(train.iloc[val_idx], num_iteration=clf.best_iteration)
    predictions_lgb += clf.predict(test, num_iteration=clf.best_iteration) / folds.n_splits
print("CV Score: {:<8.5f}".format(r2_score(target, oof_lgb)))
BUT it still gives such error messages even if I have specified the categorical_features.
ValueError: DataFrame.dtypes for data must be int, float or bool. Did not expect the data types in fields rentType, houseFloor, houseToward, houseDecoration, region, plate
And here are the requirements:
LightGBM version: 2.2.3
Pandas version: 0.24.2
Python version: 3.6.8
|Anaconda, Inc.| (default, Feb 21 2019, 18:30:04) [MSC v.1916 64 bit (AMD64)]
Could anyone help me, please?
The problem is that lightgbm can handle only features, that are of category type, not object. Here the list of all possible categorical features is extracted. Such features are encoded into integers in the code. But nothing happens to objects and thus lightgbm complains, when it finds that not all features have been transformed into numbers.
So the solution is to do
for c in categorical_feats:
    train[c] = train[c].astype('category')
before your CV loop
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With