R - Calculate rolling mean of previous k non-NA values

Question

I'm trying to calculate the rolling mean of the previous k non-NA values within the dplyr/tidyverse framework. I've written a function that seems to work but was wondering if there's already a function from some package (which will probably be much more efficient than my attempt) doing exactly this. An example dataset:

tmp.df <- data.frame(
  x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)

Let's say I want the rolling mean of the previous 3 non-NA values. Then the output y should be:

The first 5 elements of y are NAs because the first time x has 3 previous non-NA values is on row 6 and the average of those 3 elements is 2. The next y elements are self-explanatory. Row 9 gets a 4 because the 3 previous non-NA values of x are in rows 5, 6, and 7 and so on.

My attempt is this:

roll_mean_previous_k <- function(x, k){
  
  require(dplyr)
  
  res                      <- NA
  lagged_vector            <- dplyr::lag(x)
  lagged_vector_without_na <- lagged_vector[!is.na(lagged_vector)]
  previous_k_values        <- tail(lagged_vector_without_na, k)
  
  if (length(previous_k_values) >= k) res <- mean(previous_k_values)
  
  res
  
}

to be used as follows (using the slide_dbl function from the slider package):

library(dplyr)

tmp.df %>% 
  mutate(
    y = slider::slide_dbl(x, roll_mean_previous_k, k = 3, .before = Inf)
  )

which gives the desired output. However, I'm wondering if there's a ready-made, and (as mentioned before) more efficient way of doing this. I should mention that I know of rollmean and roll_mean from the zoo and RcppRoll packages respectively, but unless I'm mistaken, they seem to work on a fixed rolling window with the option of dealing with NA values (e.g ignoring them). In my case, I want to "extend" my window to include k non-NA values.

Any thoughts/suggestions are welcome.

EDIT - SIMULATION RESULTS

Thank you to all contributors. First of all, I had not mentioned that my datasets are indeed much larger and run often so any performance improvements are most welcome. I therefore ran the following simulation to check execution times, before deciding which answer to accept. Note, that some of the answers needed small tweaks to return the desired output, but if you feel that your solution is misrepresented (and therefore is less efficient than intended) feel free to let me know and I'll edit accordingly. I've used G. Grothendieck's trick from his answer below, to remove the need for if-else checks regarding the length of the lagged , non-NA vector.

So here's the simulation code:

library(tidyverse)
library(runner)
library(zoo)
library(slider)
library(purrr)
library(microbenchmark)

set.seed(20211004)
test_vector <- sample(x = 100, size = 1000, replace = TRUE)
test_vector[sample(1000, size = 250)] <- NA

# Based on GoGonzo's answer and the runner package
f_runner <- function(z, k){
  
  runner(
    x = z, 
    f = function(x) {
      mean(`length<-`(tail(na.omit(head(x, -1)), k), k)) 
    }
  )
  
}

# Based on my inital answer (but simplified), also mentioned by GoGonzo 
f_slider <- function(z, k){
  
  slide_dbl(
    z,
    function(x) {
      mean(`length<-`(tail(na.omit(head(x, -1)), k), k)) 
    },
    .before = Inf
  )
}

# Based on helios' answer. Return the correct results but with a warning.
f_helios <- function(z, k){
  
    reduced_vec <-  na.omit(z)
    unique_means <-  rollapply(reduced_vec, width = k, mean)
    
    start <-  which(!is.na(z))[k] + 1
    repeater <-  which(is.na(z)) + 1
    repeater_cut <-  repeater[(repeater > start-1) & (repeater <= length(z))]
    
    final <- as.numeric(rep(NA, length(z)))
    index <-  start:length(z)
    final[setdiff(index, repeater_cut)] <- unique_means
    final[(start):length(final)] <- na.locf(final)
    final
}

# Based on G. Grothendieck's answer (but I couldn't get it to run with the performance improvements)
f_zoo <- function(z, k){
  
  rollapplyr(
    z, 
    seq_along(z), 
    function(x, k){
      mean(`length<-`(tail(na.omit(head(x, -1)), k), k)) 
    },
    k)

}

# Based on AnilGoyal's answer
f_purrr <- function(z, k){
  
    map_dbl(
      seq_along(z), 
      ~ ifelse(
        length(tail(na.omit(z[1:(.x -1)]), k)) == k,
        mean(tail(na.omit(z[1:(.x -1)]), k)), 
        NA
        )
      )

}

# Check if all are identical #
all(
  sapply(
    list(
      # f_helios(test_vector, 10),
      f_purrr(test_vector, 10),
      f_runner(test_vector, 10),
      f_zoo(test_vector, 10)
    ), 
    FUN = identical, 
    f_slider(test_vector, 10),
  )
)

# Run benchmarking #
microbenchmark(
  # f_helios(test_vector, 10),
  f_purrr(test_vector, 10),
  f_runner(test_vector, 10),
  f_slider(test_vector, 10),
  f_zoo(test_vector, 10)
)

And the results:

Unit: milliseconds
                      expr     min       lq     mean   median       uq      max neval  cld
  f_purrr(test_vector, 10) 31.9377 37.79045 39.64343 38.53030 39.65085 104.9613   100   c 
 f_runner(test_vector, 10) 23.7419 24.25170 29.12785 29.23515 30.32485  98.7239   100  b  
 f_slider(test_vector, 10) 20.6797 21.71945 24.93189 26.52460 27.67250  32.1847   100 a   
    f_zoo(test_vector, 10) 43.4041 48.95725 52.64707 49.59475 50.75450 122.0793   100    d

Based on the above, and unless the code can be further improved, it seems as the slider and runner solutions are faster. Any final suggestions are more than welcome.

Many thanks for your time!!

GoGonzo · Accepted Answer

With runner it will be something like mean of 3-elements tail window of non-na values. You can achive the same result with slider

library(runner)
tmp.df <- data.frame(
  x = c(NA, 1, 2, NA, 3, 4, 5, NA, NA, NA, 6, 7, NA)
)

# using runner
tmp.df$y_runner <- runner(
  x = tmp.df$x, 
  f = function(x) {
    mean(
      tail(
        x[!is.na(x)],
        3
      )
    )
  }
)

# using slider
tmp.df$y_slider <- slider::slide_dbl(
  tmp.df$x, 
  function(x) {
    mean(
      tail(
        x[!is.na(x)],
        3
      )
    )
  }, 
  .before = Inf
)

tmp.df


#    x    y_runner y_slider
# 1  NA      NaN      NaN
# 2   1      1.0      1.0
# 3   2      1.5      1.5
# 4  NA      1.5      1.5
# 5   3      2.0      2.0
# 6   4      3.0      3.0
# 7   5      4.0      4.0
# 8  NA      4.0      4.0
# 9  NA      4.0      4.0
# 10 NA      4.0      4.0
# 11  6      5.0      5.0
# 12  7      6.0      6.0
# 13 NA      6.0      6.0

R - Calculate rolling mean of previous k non-NA values

Tags:

r

na

dplyr

rolling-computation

Constantinos

1 Answers

GoGonzo

Recent Activity

Donate For Us

R - Calculate rolling mean of previous k non-NA values

Tags:

r

na

dplyr

rolling-computation

Constantinos

1 Answers

GoGonzo

Related questions

Recent Activity

Donate For Us