I need to do a classification (R xgboost or catboost etc) on some data, which has together about 30 input variables. One of the dataset variables is a factor variable, with 100 possible levels ("n01", "n02", ..., "n100") in the training and test sets. So when I use one hot encoding on existing data (with R sparse.model.matrix) I get 100 new columns, one column for each level of the factor.
So I make a model (train, test etc) with the dataset, including these 100 levels of factor variable. But when I need to predict (use the model) on new data, there are only a few new data samples and then I get less levels of this factor variable. Both xgboost and catboost give an error, that feature names are different.
Idea to include data for prediction into data for model creation is not acceptable, because I need to make a model once, and then use it for prediction on new data each time.
What are the possible ways to solve this problem?
Perhaps this shows you the way. You can encode your categorical variables as factor with the full set of levels and then use model.matrix to expand these into dummies.
Note how in my example, the colnames(x_test) command shows the full set of dummies despite only two levels of z being present in the test data.
Take care that the factor variables in your test data have the same levels as those in the train data. If the train and test data both originated from the same data frame, this should require no extra care. If the train and test data are originating from different sources, you need to set the levels of the factors in the test data to be the same as the test. I show this in this answer using |> factor(levels(train_data$z))
library(tidyverse)
library(xgboost)
# simulated data ----
train_data <- expand_grid(
z = as.factor(letters[1:10]),
id = 1:10
) |>
mutate(
x1 = runif(100), x2 = runif(100), x3 = runif(100),
y = x1 + x2 + 0.1*x3 + rnorm(100)
)
test_data <- expand_grid(
z = letters[4:5] |> factor(levels(train_data$z)),
) |>
mutate(
x1 = runif(2), x2 = runif(2), x3 = runif(2),
y = x1 + x2 + 0.1*x3 + rnorm(2)
)
print(train_data)
#> # A tibble: 100 × 6
#> z id x1 x2 x3 y
#> <fct> <int> <dbl> <dbl> <dbl> <dbl>
#> 1 a 1 0.885 0.586 0.437 0.195
#> 2 a 2 0.776 0.981 0.758 1.08
#> 3 a 3 0.578 0.835 0.419 1.15
#> 4 a 4 0.00173 0.619 0.864 1.97
#> 5 a 5 0.317 0.234 0.620 1.26
#> 6 a 6 0.100 0.556 0.651 1.92
#> 7 a 7 0.388 0.202 0.198 0.138
#> 8 a 8 0.374 0.972 0.0275 2.74
#> 9 a 9 0.944 0.776 0.140 0.911
#> 10 a 10 0.200 0.488 0.00789 -0.647
#> # ℹ 90 more rows
print(test_data)
#> # A tibble: 2 × 5
#> z x1 x2 x3 y
#> <fct> <dbl> <dbl> <dbl> <dbl>
#> 1 d 0.758 0.986 0.785 1.63
#> 2 e 0.0283 0.270 0.235 -1.59
# model ----
form <- y ~ z + x1 + x2 + x3
x <- model.matrix(form, train_data)[, -1]
y <- train_data$y
model <- xgboost(x, y, nrounds = 10)
#> [1] train-rmse:0.910737
#> [2] train-rmse:0.761354
#> [3] train-rmse:0.658591
#> [4] train-rmse:0.567893
#> [5] train-rmse:0.516402
#> [6] train-rmse:0.476570
#> [7] train-rmse:0.430633
#> [8] train-rmse:0.390250
#> [9] train-rmse:0.355300
#> [10] train-rmse:0.333189
# predict ----
x_test <- model.matrix(form, test_data)[, -1]
colnames(x_test)
#> [1] "zb" "zc" "zd" "ze" "zf" "zg" "zh" "zi" "zj" "x1" "x2" "x3"
predict(model, x_test)
#> [1] 1.3814828 0.3808365
Created on 2024-01-04 with reprex v2.0.2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With