Say you have an array like
dat <- array(c(126, 100, 35, 61, 908, 688, 497, 807, 913, 747, 336, 598, 235, 172, 58, 121,402, 308, 121, 215, 182, 156, 72, 98, 60, 99, 11, 43, 104, 89, 21, 36), dim = c(2, 2, 8),dimnames = list(a = c(1, 0), b = c(1, 0), c = 1:8))
> > dat
, , c = 1
b
a 1 0
1 126 35
0 100 61
, , c = 2
b
a 1 0
1 908 497
0 688 807
, , c = 3
b
a 1 0
1 913 336
0 747 598
, , c = 4
b
a 1 0
1 235 58
0 172 121
, , c = 5
b
a 1 0
1 402 121
0 308 215
, , c = 6
b
a 1 0
1 182 72
0 156 98
, , c = 7
b
a 1 0
1 60 11
0 99 43
, , c = 8
b
a 1 0
1 104 21
0 89 36
and you want to fit logistic regression to predict a. Is there a simple way to generate a data frame from this array to use in glm? ie a data frame like
a b c
1 1 1 for 126 rows then
...
0 1 1 for 100 rows, etc.
Basically I need to get data to fit logistic regression when given the table with counts. It seems like there should be a simple way of doing it without manually generating the data.
thanks
One way would be to start with the melt function in the reshape2 package:
library(reshape2)
datM <- melt(dat)
head(datM, 2)
# a b c value
# 1 1 1 1 126
# 2 0 1 1 100
Then dcast that data to get the numbers of outcomes on one row:
dat2 <- dcast(datM, b + c ~ a)
head(dat2, 2)
# b c 0 1
# 1 0 1 61 35
# 2 0 2 807 497
You can then use this data to perform a glm where the response is a 2-column matrix giving the numbers of successes and failures:
response <- as.matrix(dat2[, c(4, 3)])
bb <- dat2[, "b"]
cc <- dat2[, "c"]
glm1 <- glm(response ~ bb + cc, family = binomial(link = "logit"))
However, the model degrees of freedom (and log-likelihood, etc.) will not reflect the data structure you asked for in your question. To get the specific data structure you were aiming for, you could go back to the datM object.
EDIT:
The following loops over all columns of datM except for the value column, repeating the values datM$value times:
datRep <- lapply(datM[-grep("value", names(datM))], rep, times = datM$value)
Then cbind that back into a matrix and convert to data.frame to get the data structure you wanted:
dat3 <- as.data.frame(do.call(cbind, datRep))
glm2 <- glm(a ~ b + c, data = dat3, family = binomial(link = "logit"))
The coefficients of the two models are the same:
> coef(glm1)
(Intercept) bb cc
-0.43854838 0.77039283 -0.03328575
> coef(glm2)
(Intercept) b c
-0.43854838 0.77039283 -0.03328575
But, as mentioned, the degrees of freedom, etc will not be:
> glm1$deviance
[1] 29.39535
> glm2$deviance
[1] 11381.87
Ugly as sin, but does what you need for this example.
dat1 <- data.frame(value = as.vector(dat),
a=dimnames(dat)$a,
b=rep(dimnames(dat)$b, each=length(dimnames(dat)$a)),
c=rep(dimnames(dat)$c, each=length(dimnames(dat)$a)*length(dimnames(dat)$b)))
Better would be to use melt, as in @BenBarnes answer. This is more flexible and avoids the creation of factors.
dat1 <- melt(dat)
Then to get the expanded rows, you can use rep
dat2 <- data.frame(a=rep(dat1$a, dat1$value),
b=rep(dat1$b, dat1$value),
c=rep(dat1$c, dat1$value))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With