Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Memory efficient scale() function

I am trying to scale a large matrix (the matrix I'm actually working with is much larger):

x <- matrix(rnorm(1e8), nrow=1e4)
x <- scale(x)

This matrix uses ~800 MB of memory. However, with lineprof, I see that the scale() function allocates 9.5 GB of memory and releases 8.75 GB after it has finished running. Because this function is so memory inefficient, it will sometimes crash my session when I run it.

I am trying to find a memory-efficient way to run this function. If I code it myself, it only allocates ~6.8 GB, but this still seems like a lot:

x <- matrix(rnorm(1e8), nrow=1e4)
u <- apply(x, 2, mean)
s <- apply(x, 2, sd)
x <- t((t(x) - u)/s)

I thought I could do even better by splitting the columns of x into groups, then scaling each column group separately:

x <- matrix(rnorm(1e8), nrow=1e4)
g <- split(1:ncol(x), ceiling(1:ncol(x)/100))
for (j in g) {
  x[, j] <- scale(x[, j])
}

With profvis, I see that overall this function is LESS efficient. It allocates 10.8 GB of memory and releases 10.5 GB. However, I think R can probably do garbage collection within the for loop, but it is not doing so because it doesn't need to. Is this correct? If so, then this might be the best option?


Questions:

• What is the best way to code functions like these to avoid memory crashes? (If a package is available, even better)

• How do I account for garbage collection while profiling code? My understanding is that GC isn't always run unless it is needed.


Update: In terms of runtime, splitting the columns into 10 groups is not much slower than using the scale(x) function. Running both functions on a [1000 x 1000] matrix, the mean runtimes assessed with microbenchmark are:

• scale(x) = 154 ms

• splitting into 10 column groups = 167 ms

• splitting into 1000 column groups (i.e. scaling each column separately) = 373 ms

like image 839
adn bps Avatar asked Mar 01 '26 23:03

adn bps


1 Answers

Given that you do not need to keep the original matrix, you can save some memory by directly modifying it (instead of making copies of it). Also, you can bypass base::scale with a simple for loop. For instance:

library(profvis) # to profile RAM usage and time
library(matrixStats) # CPU/RAM efficient versions of rowMeans

profvis({
  set.seed(1)
  x = matrix(rnorm(1e7), nrow=1e3) # Note that I reduced nrow and ncol (I do not have enough RAM to test at your desired matrix dimension)
  x = scale(x)
})

profvis({
  set.seed(1)
  x = matrix(rnorm(1e7), nrow=1e3) # Note that I reduced nrow and ncol (I do not have enough RAM to test at your desired matrix dimension)
  mu = matrixStats::colMeans2(x)
  sigma = matrixStats::colSds(x)
  for(i in 1:ncol(x))
  {
    x[,i] = (x[,i]-mu[i])/sigma[i]
  }
})

In my machine, peak memory is substantially reduced only with these minor changes (please test at your desired matrix dimension).

like image 179
Ventrilocus Avatar answered Mar 03 '26 16:03

Ventrilocus