I'm fitting some models in R using brms
. The data are from an experiment with per-word reading times, and I want to fit the same kinds of models on data from different words, so I put the code to fit the models into a function that accepts the data to run the models on as an argument. I am saving the models to files so that I don't need to re-fit them for certain evaluations I'll be doing.
However, I've noticed that when I call the function that fits the models, the RDS files that brm
saves grow larger and larger in size, even when the models should have the same number of parameters. I realize there will be a little variation due to the random nature of MCMC sampling, but what appears to be happening is that all of the data in the function environment at the point when the model is saved is somehow ending up in the RDS with the model object. For instance, the first model has 11 parameters (3 fixed effects, and 1 intercept + 3 fixed effects for each of 2 two random effects). This model takes up ~141 MB on disk. The second model has a different specification, but exactly the same number of parameters, and it takes up ~282 MB (2 x 141 MB) on disk. The third model has, again, the same number of parameters, and it takes up ~423 MB on disk (3 x 141 MB), and so on.
Since these models take a long time to fit, I've made a MWE that shows the same behavior on a smaller dataset with fewer samples drawn (brms
will complain about the ESS, but the point is that the models finish quickly so that the sizes of the saved files can be inspected).
library(brms)
fit.models <- function() {
set.seed(0)
m1 <- brm(
formula = Sepal.Length ~ Sepal.Width,
data = iris,
cores = 4,
chains = 4,
iter = 100,
file = 'm1-function.rds'
)
set.seed(0)
m2 <- brm(
formula = Sepal.Length ~ Petal.Length,
data = iris,
cores = 4,
chains = 4,
iter = 100,
file = 'm2-function.rds'
)
}
fit.models()
set.seed(0)
m1 <- brm(
formula = Sepal.Length ~ Sepal.Width,
data = iris,
cores = 4,
chains = 4,
iter = 100,
file = 'm1-global.rds'
)
set.seed(0)
m2 <- brm(
formula = Sepal.Length ~ Petal.Length,
data = iris,
cores = 4,
chains = 4,
iter = 100,
file = 'm2-global.rds'
)
Here's the result of running this on my computer:
Note that m2-function.rds
is roughly twice as large as m1-function.rds
, while m1-global.rds
is about the same size as m2-global.rds
.
I'm not sure if this is unique to brms
. However, I ran a test using some simple vectors and lists with random numbers, and all the file sizes come out exactly the same, regardless of whether they were called from within the function (which turns out to be 5202 KB).
test <- function() {
x <- list(runif(1e6))
saveRDS(x, 'x-function.rds')
y <- list(runif(1e6))
saveRDS(y, 'y-function.rds')
}
test()
x <- list(runif(1e6))
saveRDS(x, 'x-global.rds')
y <- list(runif(1e6))
saveRDS(x, 'y-global.rds')
So this doesn't seem to be default behavior in R for any objects saved to RDS. Whatever it is, something brms
is doing with regards to how it saves files seems to be responsible. My guess is that it has something to do with how it decides what to include from the calling environment, but I don't know how to control that.
In case it's not obvious, my question is the following: how can I stop this happening so the files don't take up gobs of unnecessary space? In my case, the fitted models can take up to 1 GB already in some cases, so including that in every subsequent saved model is quickly going to get out of hand.
I don't know if this will help or not, but I made some hacky functions for chopping out environment bits so that functions could be stored more compactly. I haven't experimented with these lately.
This is the kind of task that the butcher package is supposed to do, but at present it doesn't have any brms
methods (but the functions below might be suitable for integration there ...)
hack_size <- function(x, ...) {
UseMethod("hack_size")
}
hack_size.stanfit <- function(x) {
x@stanmodel <- structure(numeric(0), class="stanmodel")
[email protected] <- new.env()
return(x)
}
hack_size.brmsfit <- function(x) {
x$fit <- hack_size(x$fit)
return(x)
}
hack_size.stanreg <- function(x) {
x$stanfit <- hack_size(x$stanfit)
return(x)
}
After running
saveRDS(hack_size(m1), "m1-hack.rds")
saveRDS(hack_size(m2), "m2-hack.rds")
I get
32M Apr 3 18:43 m2-function.rds
22M Apr 3 18:43 m1-function.rds
11M Apr 3 18:43 m1-global.rds
11M Apr 3 18:43 m2-global.rds
79K Apr 3 18:46 m1-hack.rds
77K Apr 3 18:46 m2-hack.rds
I don't know exactly what functionality the hacked version is capable of, but I use this in the examples for broom.mixed
, so they're not completely crippled ...
This is an extension of Ben Bolker's answer. It didn't quite work for me because the environment was also being stored as part of the formula
and data
as well. I also had to use new.env(parent = baseenv())
, instead of just new.env()
, since that didn't seem to work on its own for me. I also removed the line that replaced the stanmodel, since the Bayes factor analyses I'm also doing require it to be there. I'm adding this as an additional answer rather than a comment since there were enough changes to the code that it wouldn't fit in a comment, and the formatting would be hard to follow.
So, in addition to Ben's code, I added these functions:
hack_size.brmsformula <- function(x) {
environment(x$formula) <- new.env(parent = baseenv())
return(x)
}
hack_size.data.frame <- function(x) {
environment(attr(x, "terms")) <- new.env(parent = baseenv())
return(x)
}
And I modified the brmsfit
and stanreg
hack_size
functions to call these:
hack_size.brmsfit <- function(x) {
x$formula <- hack_size(x$formula)
x$data <- hack_size(x$data)
x$fit <- hack_size(x$fit)
return(x)
}
hack_size.stanreg <- function(x) {
x$formula <- hack_size(x$formula)
x$data <- hack_size(x$data)
x$stanfit <- hack_size(x$stanfit)
return(x)
}
I also slightly modified the hack_size.stanfit
function:
hack_size.stanfit <- function(x) {
[email protected] <- new.env(parent = baseenv())
return(x)
}
This worked when fitting the models inside of the function. The reduction in file size isn't quite as dramatic as when removing the stanmodel, so if that could be removed if you won't need it again. (Time will tell if there are side effects of this for my analysis pipeline, but everything seems to work for now.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With