During a simulation I created multiple data sets with > 1,000,000 variables. However, some of the values of these variables are NA and in some cases even all values are NA. Now I'd like to calculate the sum of all values of the variables but want to get NA if all values are NA.
The problem with the common sum(x, na.rm=T) or sum(na.omit(x)) is, that it returns 0 if all values are NA. Thus, I've written my own function that deals with NA in the expected way:
sumna <- function(x) {
sumna <- NULL
return(ifelse(all(is.na(x)), NA, sum(na.omit(x))))
}
However, that implementation is rather slow.
Thus, I'm looking for an implementation or pre-implemented function that sums up values of a vector, omits NA and returns NA if all values are NA.
Many thanks in advance!
The sum_ from hablar have the same behavior as the OP wanted. So, no need to reinvent the wheel
library(hablar)
sum_(c(1:10, NA))
#[1] 55
sum_(c(NA, NA, NA))
#[1] NA
and it can be used in tidyverse or data.table
library(dplyr)
df1 %>%
summarise_all(sum_)
But, if we need to change the OP's custom function, instead of ifelse, a better option is if/else
sumna <- function(x) {
if(all(is.na(x))) NA else sum(x, na.rm = TRUE)
}
Also, we can use the vectorized colSums
v1 <- colSums(df1, na.rm = TRUE)
v1[colSums(is.na(df1)) == nrow(df1)] <- NA
As the dataset is huge, we can also make use of the efficient data.table
library(data.table)
setDT(df1)[, lapply(.SD, sumna)]
Or using tidyverse
library(tidyverse)
df1 %>%
summarise_all(sumna)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With