remove rows with 50% missing across certain columns in R

Question

here is my data:

data <- data.frame(id=c(1,2,3,4,5),
                   ethnicity=c("asian",NA,NA,NA,"asian"),
                   age=c(34,NA,NA,NA,65),
                   a1=c(3,4,5,2,7),
                   a2=c("y","y","y",NA,NA),
                   a3=c("low", NA, "high", "med", NA),
                   a4=c("green", NA, "blue", "orange", NA))


  id ethnicity age a1   a2   a3     a4
   1     asian  34  3    y  low  green
   2      <NA>  NA  4    y <NA>   <NA>
   3      <NA>  NA  5    y high   blue
   4      <NA>  NA  2 <NA>  med orange
   5     asian  65  7 <NA> <NA>   <NA>

I would like to remove rows that have >50% missing in columns a1 to a4.

I have tried the below code; but am having trouble specifying the columns that I want this to take effect for:

data[which(rowMeans(!is.na(data)) > 0.5), ] #This doesn't specify the column

miss2 <- c()
for(i in 1:nrow(data)) {
  if(length(which(is.na(data[4:7,]))) >= 0.5*ncol(data)) miss2 <- append(miss2,4:7) 
}
data1 <- data[-miss2,]

#I thought I specified the column here but im not getting the output I was hoping for (i.e id 4 doesn't show up)

The code above looks at missing in all columns. I would like to specify to just look for % of missing in columns a1,a2,a3,a4. What im hoping to get is below:

  id ethnicity age a1   a2   a3     a4
   1     asian  34  3    y  low  green
   2      <NA>  NA  4    y <NA>   <NA>
   3      <NA>  NA  5    y high   blue
   4      <NA>  NA  2 <NA>  med orange

Any help is appreciated, thank you!

Ottie · Accepted Answer

You're really close, the main issue being using which (an array of indices) instead of simply an array of booleans

keep <- rowMeans(is.na(data[,4:7])) <= 0.5

keep
[1]  TRUE  TRUE  TRUE  TRUE FALSE

data[keep,]
  id ethnicity age a1   a2   a3     a4
1  1     asian  34  3    y  low  green
2  2      <NA>  NA  4    y <NA>   <NA>
3  3      <NA>  NA  5    y high   blue
4  4      <NA>  NA  2 <NA>  med orange

remove rows with >50% missing across certain columns in R

Tags:

r

data-manipulation

T K

1 Answers

Ottie

Recent Activity

Donate For Us