I have a dataframe with the following dimensions:
dim(b)
[1] 974 433685
The columns represent variables that I want to run ANOVAs on (i.e., I want to run 433,685 ANOVAs). Sample size is 974. The last column is the 'group' variable.
I've come up with 3 different methods, but all are too slow due to the number of tests.
First, let's generate a small practice dataset to play with:
dat = as.data.frame(matrix(runif(10000*500), ncol = 10000, nrow = 500))
dat$group = rep(letters[1:10], 5000)
Method 1 (based on 'sapply'):
system.time(sapply(dat[,-length(dat)], function(x) aov(x~group, data=dat) ))
user system elapsed
143.76 0.33 151.79
Methods 2 (based on 'mclapply' from 'parallel' package):
library(parallel)
options(mc.cores=3)
system.time(mclapply(dat[,-length(dat)], function(x) aov(x~group, data=dat) ))
user system elapsed
141.76 0.21 142.58
Methods 3 (based on 'cbind'-ing the LHS):
formula = as.formula( paste0("cbind(", paste(names(dat)[-length(dat)],collapse=","), ")~group") )
system.time(aov(formula, data=dat))
user system elapsed
10.00 0.22 10.25
In the practice dataset, Method 3 is a clear winner. However, when I do this on my actual data, computing on just 10 (of 433,685) columns using Method 3 takes this long:
user system elapsed
119.028 5.430 124.414
Not sure why it takes substantially longer on my actual data. I have access to a Linux cluster with upwards of 16 cores and 72GB of RAM.
Is there any way to compute this faster?
For simultaneously fitting many general linear models (such as ANOVA) using the same design matrix, the Bioconductor/R limma package provides a very fast lmFit() function. This is how to fit an ANOVA model using limma:
library(limma)
# generate some data
# (same dimensions as in your question)
nrows <- 1e4
ncols <- 5e2
nlevels <- 10
dat <- matrix(
runif(nrows * ncols),
nrow = nrows,
ncol = ncols
)
group <- factor(rep(
letters[1:nlevels],
ncols / nlevels
))
# construct the design matrix
# (same as implicitly used in your question)
dmat <- model.matrix(~ group)
# fit the ANOVA model
fit <- lmFit(dat, dmat)
On my laptop it finished in 0.4 - 0.45 seconds, on data of the same dimensions as the data in your question.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With