In real life, I try to predict the state of health of a vehicle based on numerous fault codes.
Some fault codes (levels of the factor) appear very often (1000+), others occur only two or three times. Often, those that appear very infrequently are "perfect predictors" of the health state (0 or 1). I am trying to find a sound statistical way to decide which of the levels of the factors are good predictors (significant). This in a way that infrequent fault codes that are good predictors are not discarded based on their infrequency alone.
library(tidyverse)
n_small = 4
n_big = 100
set.seed(567)
df_big_1 <- data.frame(class = rep("A", n_big),
health = rbinom(n = n_big, size = 1, prob = .4))
df_small_1 <- data.frame(class = rep("B", n_small),
health = rbinom(n = n_small, size = 1, prob = 1))
df_small_2 <- data.frame(class = rep("C", n_small),
health = rbinom(n = n_small, size = 1, prob = 1))
df_big_2 <- data.frame(class = rep("D", n_big),
health = rbinom(n = n_big, size = 1, prob = .4))
df_big_3 <- data.frame(class = rep("E", n_big),
health = rbinom(n = n_big, size = 1, prob = .4))
df_data <- rbind(df_big_1 ,df_small_1, df_big_2, df_small_2, df_big_3)
df_data <- df_data %>% mutate(class = factor(class))
df_data %>%
group_by(class) %>%
summarise(N_health = sum(health), Mean = mean(health))
# A tibble: 5 × 3
class N_health Mean
<fct> <int> <dbl>
1 A 36 0.36
2 B 4 1
3 C 4 1
4 D 40 0.4
5 E 40 0.4
When I run a binary logistic regression on this simplified data set, I fail to retrieve the infrequent but "perfect" predictors:
regmod_01 <- glm(health ~ class, family = binomial, data = df_data)
summary(regmod_01
Call:
glm(formula = health ~ class, family = binomial, data = df_data)
Deviance Residuals:
Min 1Q Median 3Q Max
-1.0108 -1.0108 -0.9448 1.3537 1.4294
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5754 0.2083 -2.762 0.00575 **
classB 17.1414 1199.7724 0.014 0.98860
classC 17.1414 1199.7724 0.014 0.98860
classD 0.1699 0.2917 0.583 0.56022
classE 0.1699 0.2917 0.583 0.56022
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 415.22 on 307 degrees of freedom
Residual deviance: 399.89 on 303 degrees of freedom
AIC: 409.89
Number of Fisher Scoring iterations: 15
Is there another way that I can try to separate the good from the bad predictors and include infrequent ones?
This is a classic instance of what is referred to in statistics as the Hauck-Donner effect.
From McMaster University, credits to @BenBolker:
The Hauck-Donner effect occurs in cases of extreme parameter estimates (e.g. in the case of complete or near-complete separation), when the quadratic approximation is extremely poor: the hallmark is large parameter estimates (e.g. |βˆ| > 10) and very large confidence intervals (leading to small Z statistics and large p values).
The predictors classB and classC being infrequent is not the cause for why their p-values are far higher than you may expect. If you repeat your exact code with n_small as 90 instead of 5, classB and classC still have those high p-values of 0.98.
n_small = 90
...
regmod_01 <- glm(health ~ class, family = binomial, data = df_data)
summary(regmod_01)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -0.5754 0.2083 -2.762 0.00575 **
classB 20.1414 1133.5725 0.018 0.98582
classC 20.1414 1133.5725 0.018 0.98582
classD 0.1699 0.2917 0.583 0.56022
classE 0.1699 0.2917 0.583 0.56022
However, if you marginally decrease the probability for B and C from perfectly 1 to 0.97 instead and keep the n_small at 90, the p-values change dramatically from almost 1.00 to incredibly significant (p<0.001), as Wald's test starts functioning properly again.
n_small = 90
...
df_small_1 <- data.frame(class = rep("B", n_small),
health = rbinom(n = n_small, size = 1, prob = 0.97))
df_small_2 <- data.frame(class = rep("C", n_small),
health = rbinom(n = n_small, size = 1, prob = 0.97))
...
regmod_01 <- glm(health ~ class, family = binomial, data = df_data)
summary(regmod_01)
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.754e-01 2.083e-01 -2.762 0.00575 **
classB 5.064e+00 1.027e+00 4.931 8.18e-07 ***
classC 3.943e+00 6.231e-01 6.328 2.49e-10 ***
classD 3.779e-15 2.946e-01 0.000 1.00000
classE 4.315e-02 2.938e-01 0.147 0.88323
The p-values obtained in standard glm output are via the Wald's test, which is known to be wildly unreliable in cases of extreme separation (perfect predictors). The likelihood ratio test (LRT) is a more suitable alternative in these cases especially for binomial regression.
glm
currently has no built-in Hauck-Donner effect detection support. However, if we repeat this test with VGAM::vglm
, we see an appropriate warning message at the bottom:
library(VGAM)
regmod_01 <- VGAM::vglm(health ~ class, family = binomialff, data = df_data)
summary(regmod_01)
(output)
Warning: Hauck-Donner effect detected in the following estimate(s):
'classB', 'classC'
To answer your question, you could go with the factors that have a significant p-value, as well as any factors with a coefficient/estimate of above 10 or below 0.1, which may not register as significant on the Wald test but in reality are very significant.
Keep in mind that the p-values for each factor level are in reference to the intercept level in a dummy contrast coding scheme, so depending on what you choose as the reference significance and odds ratios may vary widely. It is better to use a Likelihood Ratio test to determine the significance of the whole variable, not individuals levels, in the regression model.
I think you may have a bit of a misunderstanding regarding what the p-value is telling you. That metric tells you how reliable your results are. Your variable could fail to explain anything about the outcome variable and be significant. Let's say the variable has essentially no impact, but its p-value is .01. That tells you if someone redid that analysis with a sample from the same population as your data, 99 out of 100 times they'll get the same result that you did... that the variable tells you nothing about the outcome.
I'm not sure how dedicated you are to that analysis method or the commonly misunderstood p-value. I think you should look at survival ensembles (when you include a time component) or decision-based classifiers.
For example, if I make your outcome variable a factor and use caret
& randomForest
, it tells me that classes B and C are the most influential (importance
) in your dataset. varImp()
provides this information.
For example (using the outcomes below), the B
class mean decrease in accuracy is 16.306. If you exclude this class, the accuracy will decrease by over 16%. The classes E
and D
have negative results. This indicates that if you didn't include these classes, your accuracy would improve.
For this to work, you need to install the package randomForest
. You don't need to call the library, though.
Check it out:
library(caret)
df_data$health <- factor(df_data$health)
set.seed(3253)
tr <- createDataPartition(df_data$health, p = .7, list = F)
cfit <- train(health~., data = df_data[tr, ], importance = T, keep.forest = T)
varImp(cfit, scale = F) # mean decrease in accuracy
# rf variable importance
#
# Importance
# classB 16.306
# classC 11.287
# classE -0.967
# classD -2.036
varImp(cfit, type = 2) # mean increase in purity
# rf variable importance
#
# Overall
# classB 100.00
# classC 57.21
# classD 0.60
# classE 0.00
There are a lot of decision tree-based methods of analysis. One benefit of a random forest is its lack of almost any formal statistical assumptions and the inability to overfit. The same can be said for many decision tree-based methods, but not all.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With