Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

GLM R vs. Python

I am attempting to generate a logistic regression in Python that produces the same results as R. It seems close, but not the same. I made up the following example to illustrate that a difference exists. The data is not real.

R

# RStudio 1.1.453

d <- data.frame(c(0, 0, 1, 1, 1),
                c(1, 0, 0, 0, 0),
                c(0, 1, 0, 0, 0))

colnames(d) <- c("v1", "v2", "v3")

model <- glm(v1 ~ v2,
         data = d,
         family = "binomial")


summary(model)

R Output

Call:
glm(formula = v1 ~ v2, family = "binomial", data = d)

Deviance Residuals: 
       1         2         3         4         5  
-1.66511  -0.00013   0.75853   0.75853   0.75853  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)
(Intercept)    1.099      1.155   0.951    0.341
v2           -19.665   6522.639  -0.003    0.998

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 6.7301  on 4  degrees of freedom
Residual deviance: 4.4987  on 3  degrees of freedom
AIC: 8.4987

Number of Fisher Scoring iterations: 17

Python

# Python 3.7.1

import pandas as pd # 0.23.4
import statsmodels.api as sm # 0.9.0
import statsmodels.formula.api as smf # 0.9.0

d = pd.DataFrame({"v1" : [0, 0, 1, 1, 1],
                  "v2" : [1, 0, 0, 0, 0],
                  "v3" : [0, 1, 0, 0, 0]})

model = smf.glm(formula = "v1 ~ v2",
               family=sm.families.Binomial(link = sm.genmod.families.links.logit),
               data=d
               ).fit()

model.summary()

Python Output

                 Generalized Linear Model Regression Results                  
==============================================================================
Dep. Variable:                     v1   No. Observations:                    5
Model:                            GLM   Df Residuals:                        3
Model Family:                Binomial   Df Model:                            1
Link Function:                  logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -2.2493
Date:                Wed, 07 Nov 2018   Deviance:                       4.4987
Time:                        15:17:52   Pearson chi2:                     4.00
No. Iterations:                    19   Covariance Type:             nonrobust
==============================================================================
                 coef    std err          z      P>|z|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0986      1.155      0.951      0.341      -1.165       3.362
v2           -21.6647   1.77e+04     -0.001      0.999   -3.48e+04    3.47e+04
==============================================================================

There is a difference in the number of iterations. From what I can tell, there is some convergence method which may be different between the two, but I don't understand. Is there some other setting I might be missing?

like image 720
ackshooairy Avatar asked Oct 19 '25 13:10

ackshooairy


1 Answers

at a guess they have different tradeoffs with regard to numerical stability.

the variance of the v2 estimate is enormous which is probably causing them both to struggle… I'd say they've basically given the same answer, at least to the limits available with double precision arithmetic.

the R implementation allows you to pass a control parameter:

> options(digits=12)
> model <- glm(v1 ~ v2, data=d, family="binomial", control=list(trace=T))
Deviance = 4.67724333758 Iterations - 1
Deviance = 4.5570420311 Iterations - 2
Deviance = 4.51971688994 Iterations - 3
Deviance = 4.50636401333 Iterations - 4
Deviance = 4.50150009179 Iterations - 5
Deviance = 4.49971718523 Iterations - 6
Deviance = 4.49906215541 Iterations - 7
Deviance = 4.49882130019 Iterations - 8
Deviance = 4.4987327103 Iterations - 9
Deviance = 4.49870012203 Iterations - 10
Deviance = 4.49868813377 Iterations - 11
Deviance = 4.49868372357 Iterations - 12
Deviance = 4.49868210116 Iterations - 13
Deviance = 4.4986815043 Iterations - 14
Deviance = 4.49868128473 Iterations - 15
Deviance = 4.49868120396 Iterations - 16
Deviance = 4.49868117424 Iterations - 17

which displays its convergence, but I couldn't find anything similar in the Python code.

seeing the above output suggests that they could also be using different cutoffs to determine convergence; R uses epsilon = 1e-8

like image 74
Sam Mason Avatar answered Oct 21 '25 01:10

Sam Mason



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!