Speeding up a function: checking NA count before computing mean

Question

The function below calculates the mean of a vector. However, it first checks the proportion of NA's present in the vector and if above a given threshold, returns NA instead of the mean.

My issue is that my current implementation is rather innefficient. It takes more than 7x longer than simply running mean(vec, na.rm=TRUE)

I tried an alternate method using na.omit, but that is even slower.

Given the size of my data, executing the single lapply is taking over 40 minutes.

Any suggestions on how to accomplish the same task more quickly?

UPDATE - RE: @thelatemail 's solution and @Arun's comment:

I am executing this function over several hundred groups, each group of varying size. The sample data (originally) provided in this question was provided as a neat data frame simply for ease of creating artificial data.

Alternate sample data to avoid the confusion

# Sample Data 
# ------------
  set.seed(1)
  # slightly different sizes for each group
  N1 <- 5e3
  N2 <- N1 + as.integer(rnorm(1, 0, 100))

  # One group has only a moderate amount of NA's
  SAMP1 <- rnorm(N1)
  SAMP1[sample(N1, .25 * N1, FALSE)] <- NA  # add in NA's

  # Another group has many NA's
  SAMP2 <- rnorm(N2)
  SAMP2[sample(N2, .95 * N2, FALSE)] <- NA  # add in large number of NA's

  # put them all in a list
  SAMP.NEW <- list(SAMP1, SAMP2)

  # keep it clean
  rm(SAMP1, SAMP2)

# Execute 
# -------    
  lapply(SAMP.NEW, meanIfThresh)

Original Sample Data, function etc

# Sample Data 
# ------------
  set.seed(1)
  rows <- 20000  # actual data has more than 7M rows
  cols <-  1000  

  SAMP <- replicate(cols, rnorm(rows))
  SAMP[sample(length(SAMP), .25 * length(SAMP), FALSE)] <- NA  # add in NA's

  # Select 5 random rows, and have them be 90% NA
  tooSparse <- sample(rows, 5)
  for (r in tooSparse)
    SAMP[r, sample(cols, cols * .9, FALSE)] <- NA

# Function 
# ------------
    meanIfThresh <- function(vec, thresh=12/15) { 
      # Calculates the mean of vec, however, 
      #   if the number of non-NA values of vec is less than thresh, returns NA 

      # thresh : represents how much data must be PRSENT. 
      #          ie, if thresh is 80%, then there must be at least 


      len <- length(vec)

      if( (sum(is.na(vec)) / len) > thresh)
        return(NA_real_)
      # if the proportion of NA's is greater than the threshold, return NA
      # example:  if I'm looking at 14 days, and I have 12 NA's,
      #            my proportion is 85.7 % = (12 / 14)
      #            default thesh is  80.0 % = (12 / 15)
      #            Thus, 12 NAs in a group of 14 would be rejected


    # else, calculate the mean, removing NA's
    return(mean(vec, na.rm=TRUE))       
  }


  # Execute
  # -----------------
  apply(SAMP, 1, meanIfThresh)

  # Compare with `mean`
  #----------------
  plain    <- apply(SAMP, 1, mean, na.rm=TRUE)
  modified <- apply(SAMP, 1, meanIfThresh)

  # obviously different
  identical(plain, modified)
  plain[tooSparse]
  modified[tooSparse]


  microbenchmark( "meanIfThresh"   = apply(SAMP, 1, meanIfThresh)
                , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
                , times = 15L)

 #  With the actual data, the penalty is sevenfold
 #  Unit: seconds
 #           expr      min       lq   median       uq      max neval
 #   meanIfThresh 1.658600 1.677472 1.690460 1.751913 2.110871    15
 # mean (regular) 1.422478 1.485320 1.503468 1.532175 1.547450    15

thelatemail · Accepted Answer

Couldn't you just replace the high NA rows' mean values afterwards like so?:

# changed `result <- apply(SAMP,1,mean,na.rm=TRUE)`
result <- rowMeans(SAMP, na.rm=TRUE)
NArows <- rowSums(is.na(SAMP))/ncol(SAMP) > 0.8
result[NArows] <- NA

Some benchmarking:

Ricardo <- function(vec, thresh=12/15) {
    len <- length(vec)
    if( (sum(is.na(vec)) / len) > thresh)
        return(NA_real_)
    return(mean(vec, na.rm=TRUE))       
}

DanielFischer <- function(vec, thresh=12/15) {

    len <- length(vec)
    nas <- is.na(vec)
    Nna <- sum(nas)
    if( (Nna / len) > thresh)
        return(NA_real_)
    return(sum(vec[!nas])/(len-Nna))
}

thelatemail <- function(mat) {
    result <- rowMeans(mat, na.rm=TRUE)
    NArows <- rowSums(is.na(mat))/ncol(mat) > 0.8
    result[NArows] <- NA
    result
}

require(microbenchmark)
microbenchmark(m1 <- apply(SAMP, 1, Ricardo), 
               m2 <- apply(SAMP, 1, DanielFischer), 
               m3 <- thelatemail(SAMP), times = 5L)

Unit: milliseconds
                                expr       min        lq    median        uq       max neval
       m1 <- apply(SAMP, 1, Ricardo) 2923.7260 2944.2599 3066.8204 3090.8127 3105.4283     5
 m2 <- apply(SAMP, 1, DanielFischer) 2643.4883 2683.1034 2755.7032 2799.5155 3089.6015     5
                m3 <- latemail(SAMP)  337.1862  340.6339  371.6148  376.5517  383.4436     5

all.equal(m1, m2) # TRUE
all.equal(m1, m3) # TRUE

Daniel Fischer · Answer

Is it so that you have to go twice through your vector vec in your function? If you can store your NA first, maybe it could speed up your calculations a bit:

meanIfThresh2 <- function(vec, thresh=12/15) { 

  len <- length(vec)
  nas <- is.na(vec)
  Nna <- sum(nas)
  if( (Nna / len) > thresh)
    return(NA_real_)

  return(sum(vec[!nas])/(len-Nna))
}

EDIT: I performed the similar benchmarking, to see the effect on this change:

> microbenchmark(  "meanIfThresh"   = apply(SAMP, 1, meanIfThresh)
+                 , "meanIfThresh2"   = apply(SAMP, 1, meanIfThresh2)
+                 , "mean (regular)" = apply(SAMP, 1, mean, na.rm=TRUE)
+                 , times = 15L)
Unit: seconds
           expr      min       lq   median       uq      max neval
   meanIfThresh 2.009858 2.156104 2.158372 2.166092 2.192493    15
  meanIfThresh2 1.825470 1.828273 1.829424 1.834407 1.872028    15
 mean (regular) 1.868568 1.882526 1.889852 1.893564 1.907495    15

Speeding up a function: checking NA count before computing mean

Tags:

algorithm

r

vectorization

na

apply

UPDATE - RE: @thelatemail 's solution and @Arun's comment:

Alternate sample data to avoid the confusion

Original Sample Data, function etc

Ricardo Saporta

2 Answers

thelatemail

Daniel Fischer

Recent Activity

Donate For Us

Speeding up a function: checking NA count before computing mean

Tags:

algorithm

r

vectorization

na

apply

UPDATE - RE: @thelatemail 's solution and @Arun's comment:

Alternate sample data to avoid the confusion

Original Sample Data, function etc

Ricardo Saporta

2 Answers

thelatemail

Daniel Fischer

Related questions

Recent Activity

Donate For Us