Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Generate data frame from array for logistic regression

Tags:

r

Say you have an array like

dat <-  array(c(126, 100, 35, 61, 908, 688, 497, 807, 913, 747, 336, 598, 235, 172, 58, 121,402, 308, 121, 215, 182, 156, 72, 98, 60, 99, 11, 43, 104, 89, 21, 36), dim = c(2, 2, 8),dimnames = list(a = c(1, 0), b = c(1, 0), c = 1:8))


> > dat
, , c = 1

   b
a     1  0
  1 126 35
  0 100 61

, , c = 2

   b
a     1   0
  1 908 497
  0 688 807

, , c = 3

   b
a     1   0
  1 913 336
  0 747 598

, , c = 4

   b
a     1   0
  1 235  58
  0 172 121

, , c = 5

   b
a     1   0
  1 402 121
  0 308 215

, , c = 6

   b
a     1  0
  1 182 72
  0 156 98

, , c = 7

   b
a    1  0
  1 60 11
  0 99 43

, , c = 8

   b
a     1  0
  1 104 21
  0  89 36

and you want to fit logistic regression to predict a. Is there a simple way to generate a data frame from this array to use in glm? ie a data frame like

a b c
1 1 1 for 126 rows then
...
0 1 1 for 100 rows, etc.

Basically I need to get data to fit logistic regression when given the table with counts. It seems like there should be a simple way of doing it without manually generating the data.

thanks

like image 945
sayhey69 Avatar asked Dec 05 '25 04:12

sayhey69


2 Answers

One way would be to start with the melt function in the reshape2 package:

library(reshape2)

datM <- melt(dat)
head(datM, 2)
#   a b c value
# 1 1 1 1   126
# 2 0 1 1   100

Then dcast that data to get the numbers of outcomes on one row:

dat2 <- dcast(datM, b + c ~ a)
head(dat2, 2)
#   b c   0   1
# 1 0 1  61  35
# 2 0 2 807 497

You can then use this data to perform a glm where the response is a 2-column matrix giving the numbers of successes and failures:

response <- as.matrix(dat2[, c(4, 3)])
bb <- dat2[, "b"]
cc <- dat2[, "c"]
glm1 <- glm(response ~ bb + cc, family = binomial(link = "logit"))

However, the model degrees of freedom (and log-likelihood, etc.) will not reflect the data structure you asked for in your question. To get the specific data structure you were aiming for, you could go back to the datM object.

EDIT:

The following loops over all columns of datM except for the value column, repeating the values datM$value times:

datRep <- lapply(datM[-grep("value", names(datM))], rep, times = datM$value)

Then cbind that back into a matrix and convert to data.frame to get the data structure you wanted:

dat3 <- as.data.frame(do.call(cbind, datRep))

glm2 <- glm(a ~ b + c, data = dat3, family = binomial(link = "logit"))

The coefficients of the two models are the same:

> coef(glm1)
(Intercept)          bb          cc 
-0.43854838  0.77039283 -0.03328575 
> coef(glm2)
(Intercept)           b           c 
-0.43854838  0.77039283 -0.03328575 

But, as mentioned, the degrees of freedom, etc will not be:

> glm1$deviance
[1] 29.39535
> glm2$deviance
[1] 11381.87
like image 191
BenBarnes Avatar answered Dec 07 '25 19:12

BenBarnes


Ugly as sin, but does what you need for this example.

dat1 <- data.frame(value = as.vector(dat),
    a=dimnames(dat)$a,
    b=rep(dimnames(dat)$b, each=length(dimnames(dat)$a)),
    c=rep(dimnames(dat)$c, each=length(dimnames(dat)$a)*length(dimnames(dat)$b)))

Better would be to use melt, as in @BenBarnes answer. This is more flexible and avoids the creation of factors.

dat1 <- melt(dat)

Then to get the expanded rows, you can use rep

dat2 <- data.frame(a=rep(dat1$a, dat1$value),
                   b=rep(dat1$b, dat1$value),
                   c=rep(dat1$c, dat1$value))
like image 26
Matthew Lundberg Avatar answered Dec 07 '25 19:12

Matthew Lundberg



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!