Can R get the colMeans for the non-zero values of a data frame?
data<-data.frame(col1=c(1,0,1,0,3,3),col2=c(5,0,5,0,7,7))
colMeans(data)   # 1.33,4
I would like something like:
mean(data$col1[data$col1>0]) # 2
mean(data$col2[data$col2>0]) # 6
Thanks in advance: D
n <- 2E4
m <- 1E3
data <- matrix(runif(n*m),nrow = n)
system.time (col_means <- colSums(data)/colSums(!!data) ) 
#   user  system elapsed 
# 0.087   0.007   0.094 
system.time (   colMeans(NA^(data==0)*data, na.rm=TRUE)) 
#   user  system elapsed 
#  0.167   0.084   0.251 
system.time (vapply(data, function(x) mean(x[x!=0]), numeric(1))) 
#   user  system elapsed 
#126.519   0.737 127.715 
library(dplyr)
system.time (summarise_each(data, funs(mean(.[.!=0])))) # Gave error
You can use colSums on both the data and it's "logical representation" to divide the column sums by the number of non-zero elements for each column:
colSums(data)/colSums(!!data)
col1 col2 
   2    6 
You could change the 0 to NA and then use colMeans as it has an option for na.rm=TRUE.  In a two step process, we convert the data elements that are '0' to 'NA', and then get the colMeans excluding the NA elements.
  is.na(data) <- data==0
  colMeans(data, na.rm=TRUE) 
  #   col1 col2 
  #    2    6 
If you need that in a single step, we can change the logical matrix  (data==0) to NA and 1 by doing (NA^) for values corresponding to '0' and non-zero elements and then multiply with original data so that 1 value change to the element in that position and NA remains as such.  We can do colMeans on that output as above.
   colMeans(NA^(data==0)*data, na.rm=TRUE)
   #  col1 col2 
   #   2    6 
Another option is using sapply/vapply.  If the dataset is really big, converting to a matrix may not be a good idea as it may cause issues with memory.  By looping through the columns either with sapply or a more specific vapply (would be a bit more fast), we get the mean of the non-zero elements.
 vapply(data, function(x) mean(x[x!=0]), numeric(1))
 #  col1 col2 
 #  2    6 
Or we can use summarise_each and specify the function inside the funs after subsetting the non-zero elements.
 library(dplyr)
 summarise_each(data, funs(mean(.[.!=0])))
 #  col1 col2
 #1    2    6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With