I am using the glmulti()
package in R to try and run an all-subset regression on some data. I have 51 predictors, all with a maximum of 276 observations. I realize that the exhaustive and genetic algorithm approaches cannot compute with this many variables as I receive the following:
Warning message:
In glmulti(y = "Tons_N", data = MDatEB1_TonsN, level = 1, method = "h", :
!Too many predictors.
With these types of requirements (i.e. many variables with lots of observations), how many will I be able to use in a single run of the all-subsets regression? I am looking into variable elimination techniques but I would like to use as many variables as possible in this stage of the analysis. That is, I want to use the results of this analysis to make variable elimination decisions. Is there another package that can process more variables at a time?
Here is the code I am using. Unfortunately, because of the confidentiality associated with the project, I cannot attach datasets.
TonsN_AllSubset <- glmulti(Tons_N ~ ., data = MDatEB1_TonsN, level = 1, method = "h",crit = "aic", confsetsize = 20, plotty = T, report = T,fitfunction = "glm")
I am relatively new to this package and modeling in general. Any direction or advice will be greatly appreciated. Thank you!
glmulti is not restricted by the number of predictors, but by the number of candidate models.
By setting the argument method = "d", glmulti will compute the number of candidate models. Computing this takes considerably less time than running glmulti on method = "h" or method = "g".
If the number of predictors is too high, you will get the same error message. Thereby, you can try out the maximum number of predictors to be handled by glmulti within a reasonable computing time.
However, keep in mind that the maximum number of possible predictors depends strongly on whether you allow for interactions or not.
Furthermore, you can limit the number of candidate models by specifying the number of predictors in the model (eg. minsize = 0, maxsize = 1) or by excluding (exclude = c()) specific predictors or by excluding predictors in the model formula (y~a+b+c-a:b-1; this excludes the intercept and the interaction a:b). You find even more options for limiting the number of candidate models in the package documentation glmulti.pdf
The glmnet package provides the facilities to do penalized modeling without the statistically flawed strategy of stepwise selection. (There seems to be a wide spread acceptance of the fallacious argument that using AIC protects one from problems of multiple comparisons.) It is incredibly easy to "find" statistically significant relations where there are none.
This is the result of using BabakP's suggestion with a random set of predictors:
pseudodata = data.frame(matrix(NA,nrow=276,ncol=51))
pseudodata[,1] = rbinom(nrow(pseudodata),1,.3)
n1 = length(which(pseudodata[,1]==1))
n0 = length(which(pseudodata[,1]==0))
for(i in 2:ncol(pseudodata)){
pseudodata[,i] = ifelse(pseudodata[,1]==1, rnorm(n1), rnorm(n0))
}
model = glm(pseudodata[,1]~., data=pseudodata[-1])
stepwise.model = step(model,direction="both",trace=FALSE)
> summary(stepwise.model)
Call:
glm(formula = pseudodata[, 1] ~ X4 + X6 + X10 + X17 + X21 + X23 +
X25 + X29 + X32 + X37 + X41 + X48 + X50 + X19, data = pseudodata[-1])
Deviance Residuals:
Min 1Q Median 3Q Max
-0.6992 -0.2943 -0.1154 0.3663 0.9833
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 0.25674 0.02561 10.025 < 2e-16 ***
X4 -0.03573 0.02394 -1.493 0.136727
X6 -0.05045 0.02608 -1.934 0.054141 .
X10 0.05873 0.02744 2.141 0.033235 *
X17 -0.06325 0.02520 -2.510 0.012668 *
X21 0.06420 0.02504 2.564 0.010906 *
X23 -0.04961 0.02845 -1.744 0.082353 .
X25 0.03863 0.02517 1.535 0.126035
X29 0.04889 0.02381 2.054 0.041020 *
X32 -0.03669 0.02509 -1.462 0.144841
X37 0.09682 0.02507 3.862 0.000142 ***
X41 -0.05253 0.02676 -1.963 0.050704 .
X48 -0.06660 0.02279 -2.922 0.003782 **
X50 -0.06955 0.02624 -2.651 0.008517 **
X19 -0.04090 0.02701 -1.514 0.131137
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for gaussian family taken to be 0.1674429)
Null deviance: 55.072 on 275 degrees of freedom
Residual deviance: 43.703 on 261 degrees of freedom
AIC: 306.59
Number of Fisher Scoring iterations: 2
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With