The function regsubsets from the leaps package treats all levels of a categorical(factor) variable as independent dummy variables. I would like to change this behavior.
Example using the iris dataset, where Species is a factor variable:
library(leaps)
data(iris)
models <- regsubsets( Sepal.Length~., data = iris, nvmax = 4)
summary(models)
Subset selection object
Call: regsubsets.formula(Sepal.Length ~ ., data = iris, nvmax = 4)
5 Variables (and intercept)
Forced in Forced out
Sepal.Width FALSE FALSE
Petal.Length FALSE FALSE
Petal.Width FALSE FALSE
Speciesversicolor FALSE FALSE
Speciesvirginica FALSE FALSE
1 subsets of each size up to 4
Selection Algorithm: exhaustive
Sepal.Width Petal.Length Petal.Width Speciesversicolor Speciesvirginica
1 ( 1 ) " " "*" " " " " " "
2 ( 1 ) "*" "*" " " " " " "
3 ( 1 ) "*" "*" "*" " " " "
4 ( 1 ) "*" "*" " " "*" "*"
Please notice that regsubsets created the dummy variables Speciesversicolor and Speciesvirginica that now take up two of the four 'spaces' for variables in the fourth row. I would like Species to just take one space.
Is it possible to change this behavior of the regsubsets function?
A similar question has been asked before, but most of the comments (and I) agree that the question remains unanswered: https://stats.stackexchange.com/questions/152158/r-model-selection-with-categorical-variables-using-leaps-and-glmnet
Here is another similar and unanswered question: R: can I get regsubsets() to in-/exclude variables by groups?
As G.Grothendieck stated 'leaps only works with quantitative variables'.
There are however other packages that can perform feature selection using best subsets regression than can handle categorical variables with multiple levels.
It can be done in one line of code using the olsrr package and the example data.
library(olsrr)
data("iris")
model <- lm(Sepal.Length ~ ., data = iris)
ols_step_best_subset(model)
#> Best Subsets Regression
#> -----------------------------------------------------------
#> Model Index Predictors
#> -----------------------------------------------------------
#> 1 Petal.Length
#> 2 Sepal.Width Petal.Length
#> 3 Sepal.Width Petal.Length Species
#> 4 Sepal.Width Petal.Length Petal.Width Species
#> -----------------------------------------------------------
#>
#> Subsets Regression Summary
#> -----------------------------------------------------------------------------------------------------------------------------------
#> Adj. Pred
#> Model R-Square R-Square R-Square C(p) AIC SBIC SBC MSEP FPE HSP APC
#> -----------------------------------------------------------------------------------------------------------------------------------
#> 1 0.7600 0.7583 0.7532 114.5104 160.0404 -267.6979 169.0723 24.8565 0.1679 0.0011 0.2465
#> 2 0.8402 0.8380 0.834 29.4478 101.0255 -325.5037 113.0680 16.6628 0.1133 8e-04 0.1663
#> 3 0.8633 0.8595 0.8536 6.3448 81.5749 -346.0176 99.6387 14.3495 0.0989 7e-04 0.1442
#> 4 0.8673 0.8627 0.8554 4.0000 79.1160 -348.1523 100.1905 14.0259 0.0973 7e-04 0.1418
#> -----------------------------------------------------------------------------------------------------------------------------------
#> AIC: Akaike Information Criteria
#> SBIC: Sawa's Bayesian Information Criteria
#> SBC: Schwarz Bayesian Criteria
#> MSEP: Estimated error of prediction, assuming multivariate normality
#> FPE: Final Prediction Error
#> HSP: Hocking's Sp
#> APC: Amemiya Prediction Criteria
We can also use all possible subsets feature selection:
ols_step_all_possible(model)
#> Index N Predictors R-Square
#> 2 1 1 Petal.Length 0.75995465
#> 3 2 1 Petal.Width 0.66902769
#> 4 3 1 Species 0.61870573
#> 1 4 1 Sepal.Width 0.01382265
#> 5 5 2 Sepal.Width Petal.Length 0.84017784
#> 9 6 2 Petal.Length Species 0.83672378
#> 8 7 2 Petal.Length Petal.Width 0.76626130
#> 7 8 2 Sepal.Width Species 0.72590661
#> 6 9 2 Sepal.Width Petal.Width 0.70723708
#> 10 10 2 Petal.Width Species 0.66936637
#> 12 11 3 Sepal.Width Petal.Length Species 0.86330878
#> 11 12 3 Sepal.Width Petal.Length Petal.Width 0.85861172
#> 14 13 3 Petal.Length Petal.Width Species 0.83672544
#> 13 14 3 Sepal.Width Petal.Width Species 0.73238452
#> 15 15 4 Sepal.Width Petal.Length Petal.Width Species 0.86731226
#> Adj. R-Square Mallow's Cp
#> 2 0.758332718 114.510364
#> 3 0.666791387 213.189280
#> 4 0.613518054 267.801422
#> 1 0.007159294 924.253661
#> 5 0.838003384 29.447765
#> 9 0.833368794 33.196290
#> 8 0.763081179 109.666040
#> 7 0.720274553 153.461158
#> 6 0.703253908 173.722356
#> 10 0.662572526 214.821724
#> 12 0.859537986 6.344799
#> 11 0.855706481 11.442304
#> 14 0.832221316 35.194491
#> 13 0.725002020 148.430978
#> 15 0.862705049 4.000000
Created on 2020-10-07 by the reprex package (v0.3.0)
leaps only works with quantitative variables. This is specifically mentioned in Modern Applied Statistics with S by Venables and Ripley -- look up leaps in the index.
For stepwise selection based on drop and add which will include or exclude all levels of any factor use step or in MASS stepAIC.
fm <- lm(Sepal.Length ~., iris)
step(fm)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With