I'm working with a data set resembling the data.frame generated below:
set.seed(1)
dta <- data.frame(observation = 1:20,
valueA = runif(n = 20),
valueB = runif(n = 20),
valueC = runif(n = 20),
valueD = runif(n = 20))
dta[2:5,3] <- NA
dta[2:10,4] <- NA
dta[7:20,5] <- NA
The columns have NA values with the last column having more than 60% of observations NAs.
> sapply(dta, function(x) {table(is.na(x))})
$observation
FALSE
20
$valueA
FALSE
20
$valueB
FALSE TRUE
16 4
$valueC
FALSE TRUE
11 9
$valueD
FALSE TRUE
6 14
I would like to be able to remove this column in dplyr pipe line somehow passing it to the select argument.
This can be easily done in base. For example to select columns with less than 50% NAs I can do:
dta[, colSums(is.na(dta)) < nrow(dta) / 2]
which produces:
> head(dta[, colSums(is.na(dta)) < nrow(dta) / 2], 2)
observation valueA valueB valueC
1 1 0.2655087 0.9347052 0.8209463
2 2 0.3721239 NA NA
I'm interested in achieving the same flexibility in dplyr pipe line:
Vectorize(require)(package = c("dplyr", # Data manipulation
"magrittr"), # Reverse pipe
char = TRUE)
dta %<>%
# Some transformations I'm doing on the data
mutate_each(funs(as.numeric)) %>%
# I want my select to take place here
Update for 2020 perhaps, now that dplyr reached 1.0.0, which incorporates where():
dta %>% select(where(function(x) sum(is.na(x)) / length(x) < 0.5))
I think this does the job:
dta %>% select_if(~mean(is.na(.)) < 0.5) %>% head()
observation valueA valueB valueC
1 0.2655087 0.9347052 0.8209463
2 0.3721239 NA NA
3 0.5728534 NA NA
4 0.9082078 NA NA
5 0.2016819 NA NA
6 0.8983897 0.3861141 NA
`
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With