Significance test for influence of factor (with many levels of unequal size) and a binary dependent variable

Question

In real life, I try to predict the state of health of a vehicle based on numerous fault codes.

Some fault codes (levels of the factor) appear very often (1000+), others occur only two or three times. Often, those that appear very infrequently are "perfect predictors" of the health state (0 or 1). I am trying to find a sound statistical way to decide which of the levels of the factors are good predictors (significant). This in a way that infrequent fault codes that are good predictors are not discarded based on their infrequency alone.

Data Creation

library(tidyverse)
n_small =   4
n_big   = 100

set.seed(567)
df_big_1 <- data.frame(class = rep("A", n_big),
                       health = rbinom(n = n_big, size = 1, prob = .4))
df_small_1 <- data.frame(class = rep("B", n_small),
                         health = rbinom(n = n_small, size = 1, prob = 1))
df_small_2 <- data.frame(class = rep("C", n_small), 
                         health = rbinom(n = n_small, size = 1, prob = 1))
df_big_2 <- data.frame(class = rep("D", n_big),
                       health = rbinom(n = n_big, size = 1, prob = .4))
df_big_3 <- data.frame(class = rep("E", n_big),
                       health = rbinom(n = n_big, size = 1, prob = .4))
df_data <- rbind(df_big_1 ,df_small_1, df_big_2, df_small_2, df_big_3)
df_data <- df_data %>% mutate(class = factor(class))

Inspection of data

df_data %>%
  group_by(class) %>%
  summarise(N_health = sum(health), Mean = mean(health))
# A tibble: 5 × 3
  class N_health  Mean
  <fct>    <int> <dbl>
1 A           36  0.36
2 B            4  1   
3 C            4  1   
4 D           40  0.4 
5 E           40  0.4

(Binary) Logistic Regression

When I run a binary logistic regression on this simplified data set, I fail to retrieve the infrequent but "perfect" predictors:

regmod_01 <- glm(health ~ class, family = binomial, data = df_data)
summary(regmod_01

Call:
glm(formula = health ~ class, family = binomial, data = df_data)

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.0108  -1.0108  -0.9448   1.3537   1.4294  

Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)   -0.5754     0.2083  -2.762  0.00575 **
classB        17.1414  1199.7724   0.014  0.98860   
classC        17.1414  1199.7724   0.014  0.98860   
classD         0.1699     0.2917   0.583  0.56022   
classE         0.1699     0.2917   0.583  0.56022   
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 415.22  on 307  degrees of freedom
Residual deviance: 399.89  on 303  degrees of freedom
AIC: 409.89

Number of Fisher Scoring iterations: 15

Is there another way that I can try to separate the good from the bad predictors and include infrequent ones?

dcsuka · Accepted Answer

This is a classic instance of what is referred to in statistics as the Hauck-Donner effect.

From McMaster University, credits to @BenBolker:

The Hauck-Donner effect occurs in cases of extreme parameter estimates (e.g. in the case of complete or near-complete separation), when the quadratic approximation is extremely poor: the hallmark is large parameter estimates (e.g. |βˆ| > 10) and very large confidence intervals (leading to small Z statistics and large p values).

The predictors classB and classC being infrequent is not the cause for why their p-values are far higher than you may expect. If you repeat your exact code with n_small as 90 instead of 5, classB and classC still have those high p-values of 0.98.

n_small =   90
...
regmod_01 <- glm(health ~ class, family = binomial, data = df_data)
summary(regmod_01)


Coefficients:
             Estimate Std. Error z value Pr(>|z|)   
(Intercept)   -0.5754     0.2083  -2.762  0.00575 **
classB        20.1414  1133.5725   0.018  0.98582   
classC        20.1414  1133.5725   0.018  0.98582   
classD         0.1699     0.2917   0.583  0.56022   
classE         0.1699     0.2917   0.583  0.56022

However, if you marginally decrease the probability for B and C from perfectly 1 to 0.97 instead and keep the n_small at 90, the p-values change dramatically from almost 1.00 to incredibly significant (p<0.001), as Wald's test starts functioning properly again.

n_small =   90
...
df_small_1 <- data.frame(class = rep("B", n_small),
                         health = rbinom(n = n_small, size = 1, prob = 0.97))
df_small_2 <- data.frame(class = rep("C", n_small), 
                         health = rbinom(n = n_small, size = 1, prob = 0.97))
...
regmod_01 <- glm(health ~ class, family = binomial, data = df_data)
summary(regmod_01)

              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -5.754e-01  2.083e-01  -2.762  0.00575 ** 
classB       5.064e+00  1.027e+00   4.931 8.18e-07 ***
classC       3.943e+00  6.231e-01   6.328 2.49e-10 ***
classD       3.779e-15  2.946e-01   0.000  1.00000    
classE       4.315e-02  2.938e-01   0.147  0.88323

The p-values obtained in standard glm output are via the Wald's test, which is known to be wildly unreliable in cases of extreme separation (perfect predictors). The likelihood ratio test (LRT) is a more suitable alternative in these cases especially for binomial regression.

glm currently has no built-in Hauck-Donner effect detection support. However, if we repeat this test with VGAM::vglm, we see an appropriate warning message at the bottom:

library(VGAM)

regmod_01 <- VGAM::vglm(health ~ class, family = binomialff, data = df_data)
summary(regmod_01)

(output)

Warning: Hauck-Donner effect detected in the following estimate(s):
'classB', 'classC'

To answer your question, you could go with the factors that have a significant p-value, as well as any factors with a coefficient/estimate of above 10 or below 0.1, which may not register as significant on the Wald test but in reality are very significant.

Keep in mind that the p-values for each factor level are in reference to the intercept level in a dummy contrast coding scheme, so depending on what you choose as the reference significance and odds ratios may vary widely. It is better to use a Likelihood Ratio test to determine the significance of the whole variable, not individuals levels, in the regression model.

Kat · Answer

I think you may have a bit of a misunderstanding regarding what the p-value is telling you. That metric tells you how reliable your results are. Your variable could fail to explain anything about the outcome variable and be significant. Let's say the variable has essentially no impact, but its p-value is .01. That tells you if someone redid that analysis with a sample from the same population as your data, 99 out of 100 times they'll get the same result that you did... that the variable tells you nothing about the outcome.

I'm not sure how dedicated you are to that analysis method or the commonly misunderstood p-value. I think you should look at survival ensembles (when you include a time component) or decision-based classifiers.

For example, if I make your outcome variable a factor and use caret & randomForest, it tells me that classes B and C are the most influential (importance) in your dataset. varImp() provides this information.

For example (using the outcomes below), the B class mean decrease in accuracy is 16.306. If you exclude this class, the accuracy will decrease by over 16%. The classes E and D have negative results. This indicates that if you didn't include these classes, your accuracy would improve.

For this to work, you need to install the package randomForest. You don't need to call the library, though.

Check it out:

library(caret) 
    
df_data$health <- factor(df_data$health)

set.seed(3253)
tr <- createDataPartition(df_data$health, p = .7, list = F)

cfit <- train(health~., data = df_data[tr, ], importance = T, keep.forest = T)

varImp(cfit, scale = F)  # mean decrease in accuracy
# rf variable importance
# 
#        Importance
# classB     16.306
# classC     11.287
# classE     -0.967
# classD     -2.036 

varImp(cfit, type = 2)  # mean increase in purity
# rf variable importance
# 
#        Overall
# classB  100.00
# classC   57.21
# classD    0.60
# classE    0.00

There are a lot of decision tree-based methods of analysis. One benefit of a random forest is its lack of almost any formal statistical assumptions and the inability to overfit. The same can be said for many decision tree-based methods, but not all.

Significance test for influence of factor (with many levels of unequal size) and a binary dependent variable

Tags:

r

Data Creation

Inspection of data

(Binary) Logistic Regression

KoenV

2 Answers

dcsuka

Kat

Recent Activity

Donate For Us

Significance test for influence of factor (with many levels of unequal size) and a binary dependent variable

Tags:

r

Data Creation

Inspection of data

(Binary) Logistic Regression

KoenV

2 Answers

dcsuka

Kat

Related questions

Recent Activity

Donate For Us