here is my data:
data <- data.frame(id=c(1,2,3,4,5),
ethnicity=c("asian",NA,NA,NA,"asian"),
age=c(34,NA,NA,NA,65),
a1=c(3,4,5,2,7),
a2=c("y","y","y",NA,NA),
a3=c("low", NA, "high", "med", NA),
a4=c("green", NA, "blue", "orange", NA))
id ethnicity age a1 a2 a3 a4
1 asian 34 3 y low green
2 <NA> NA 4 y <NA> <NA>
3 <NA> NA 5 y high blue
4 <NA> NA 2 <NA> med orange
5 asian 65 7 <NA> <NA> <NA>
I would like to remove rows that have >50% missing in columns a1 to a4.
I have tried the below code; but am having trouble specifying the columns that I want this to take effect for:
data[which(rowMeans(!is.na(data)) > 0.5), ] #This doesn't specify the column
miss2 <- c()
for(i in 1:nrow(data)) {
if(length(which(is.na(data[4:7,]))) >= 0.5*ncol(data)) miss2 <- append(miss2,4:7)
}
data1 <- data[-miss2,]
#I thought I specified the column here but im not getting the output I was hoping for (i.e id 4 doesn't show up)
The code above looks at missing in all columns. I would like to specify to just look for % of missing in columns a1,a2,a3,a4. What im hoping to get is below:
id ethnicity age a1 a2 a3 a4
1 asian 34 3 y low green
2 <NA> NA 4 y <NA> <NA>
3 <NA> NA 5 y high blue
4 <NA> NA 2 <NA> med orange
Any help is appreciated, thank you!
You're really close, the main issue being using which (an array of indices) instead of simply an array of booleans
keep <- rowMeans(is.na(data[,4:7])) <= 0.5
keep
[1] TRUE TRUE TRUE TRUE FALSE
data[keep,]
id ethnicity age a1 a2 a3 a4
1 1 asian 34 3 y low green
2 2 <NA> NA 4 y <NA> <NA>
3 3 <NA> NA 5 y high blue
4 4 <NA> NA 2 <NA> med orange
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With