Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to do logistic regression on summary data in R?

So I have some data that is structured similarly to the following:

         | Works  | DoesNotWork |
         ----------------------- 
Unmarried| 130    | 235         |
Married  | 10     | 95          |

I'm trying to use logistic regression to predict Work Status from the Marriage Status, however I don't think I understand how to in R. For example, if my data looks like the following:

MarriageStatus  | WorkStatus| 
-----------------------------
Married         | No        |
Married         | No        |
Married         | Yes       |
Unmarried       | No        |
Unmarried       | Yes       |
Unmarried       | Yes       |

I understand that I could do the following:

log_model <- glm(WorkStatus ~ MarriageStatus, data=MarriageDF, family=binomial(logit))

When the data is summarized, I just don't understand how to do this. Do I need to expand the data into a non-summarized form and encode Married/Unmarried as 0/1 and do the same for Working/Not Working and encode it as 0/1? .

Given only the first summary DF, how would I write the logistic regression glm function? Something like this?

log_summary_model <- glm(Works ~ DoesNotWork, data=summaryDF, family=binomial(logit))

But that doesn't make sense as I'm splitting the response dependent variable?

I'm not sure if I'm over complicating this, any help would be greatly appreciated , thanks!

like image 623
ocean800 Avatar asked Oct 20 '25 10:10

ocean800


2 Answers

You need to expand the contingency table into a data frame then a logit model can be calculated using the frequency count as a weight variable:

mod <- glm(works ~ marriage, df, family = binomial, weights = freq)
summary(mod) 

Call:
glm(formula = works ~ marriage, family = binomial, data = df, 
    weights = freq)

Deviance Residuals: 
      1        2        3        4  
 16.383    6.858  -14.386   -4.361  

Coefficients:
            Estimate Std. Error z value Pr(>|z|)    
(Intercept)  -0.5921     0.1093  -5.416 6.08e-08 ***
marriage     -1.6592     0.3500  -4.741 2.12e-06 ***
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 572.51  on 3  degrees of freedom
Residual deviance: 541.40  on 2  degrees of freedom
AIC: 545.4

Number of Fisher Scoring iterations: 5

Data:

df <- read.table(text = "works marriage freq
                 1 0 130
                 1 1 10
                 0 0 235
                 0 1 95", header = TRUE)
like image 180
Ritchie Sacramento Avatar answered Oct 21 '25 23:10

Ritchie Sacramento


This should do it for you.

library(dplyr)
library(tibble)

# Load data
MarriageDF <- tribble(
  ~'MarriageStatus',  ~'WorkStatus', 
   'Married',  'No',
   'Married',  'No',
   'Married',  'Yes',
   'Unmarried',  'No',
   'Unmarried',  'Yes',
   'Unmarried',  'Yes') %>% 
  mutate(., WorkStatus = as.factor(WorkStatus) %>% as.numeric(.) - 1) # convert to 0/1

log_model <- glm(WorkStatus ~ MarriageStatus, data = MarriageDF, family = 'binomial')
summary(log_model)

::Editing because I believe I read a previous version of the questions::

Yes, you need to 'expand' the data, or format it so that it is tidy (one observation per row). I don't believe there is a way to do logistic regression with the data you have in the first table.

like image 31
babylinguist Avatar answered Oct 21 '25 23:10

babylinguist