Conditionally selecting columns in dplyr where certain proportion of values is NA

Question

Data

I'm working with a data set resembling the data.frame generated below:

set.seed(1)
dta <- data.frame(observation = 1:20,
                  valueA = runif(n = 20),
                  valueB = runif(n = 20),
                  valueC = runif(n = 20),
                  valueD = runif(n = 20))
dta[2:5,3] <- NA
dta[2:10,4] <- NA
dta[7:20,5] <- NA

The columns have NA values with the last column having more than 60% of observations NAs.

> sapply(dta, function(x) {table(is.na(x))})
$observation

FALSE 
   20 

$valueA

FALSE 
   20 

$valueB

FALSE  TRUE 
   16     4 

$valueC

FALSE  TRUE 
   11     9 

$valueD

FALSE  TRUE 
    6    14

Problem

I would like to be able to remove this column in dplyr pipe line somehow passing it to the select argument.

Attempts

This can be easily done in base. For example to select columns with less than 50% NAs I can do:

dta[, colSums(is.na(dta)) < nrow(dta) / 2]

which produces:

> head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2)
  observation    valueA    valueB    valueC
1           1 0.2655087 0.9347052 0.8209463
2           2 0.3721239        NA        NA

Task

I'm interested in achieving the same flexibility in dplyr pipe line:

Vectorize(require)(package = c("dplyr",         # Data manipulation
                               "magrittr"),     # Reverse pipe

char = TRUE)

dta %<>%
  # Some transformations I'm doing on the data
  mutate_each(funs(as.numeric)) %>% 
  # I want my select to take place here

MS Berends · Accepted Answer

Update for 2020 perhaps, now that dplyr reached 1.0.0, which incorporates where():

dta %>% select(where(function(x) sum(is.na(x)) / length(x) < 0.5))

toscanouser · Answer

I think this does the job:

dta %>% select_if(~mean(is.na(.)) < 0.5) %>% head() 


 observation    valueA    valueB    valueC
  1           0.2655087 0.9347052 0.8209463
  2           0.3721239        NA        NA
  3           0.5728534        NA        NA
  4           0.9082078        NA        NA
  5           0.2016819        NA        NA
  6           0.8983897 0.3861141        NA

`

Conditionally selecting columns in dplyr where certain proportion of values is NA

Tags:

dataframe

r

filter

na

dplyr

Data

Problem

Attempts

Task

Konrad

2 Answers

MS Berends

toscanouser

Recent Activity

Donate For Us

Conditionally selecting columns in dplyr where certain proportion of values is NA

Tags:

dataframe

r

filter

na

dplyr

Data

Problem

Attempts

Task

Konrad

2 Answers

MS Berends

toscanouser

Related questions

Recent Activity

Donate For Us