In the data frame there is a variable called YOB. As you can see, there are 333 NA values.
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
1880 1970 1983 1980 1993 2039 333
I identified some outliers and want to get rid of them. Anything less than 1900 and greater than 2003 shall be removed. I tried to do this by indexing.
train = train[which(train$YOB >= 1900 & train$YOB <= 2003),]
Unfortunately observations whose YOB variable were NA are also removed.
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1970 1983 1980 1993 2003
On a side note, I face the same problem when using subset command.
> train = subset(train, YOB >= 1900 & YOB <= 2003)
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1970 1983 1980 1993 2003
I have also tried to use this condition in both attempts, but with no success, e.g.
> train = train[which(!is.na(train$YOB) & train$YOB >= 1900 & train$YOB <= 2003),]
> summary(train$YOB)
Min. 1st Qu. Median Mean 3rd Qu. Max.
1900 1970 1983 1980 1993 2003
I would like to keep the observations that have NA in the YOB variable and only remove those that are numeric. The idea is in a second step to impute missing values.
The which will give the numeric index and skip all those NA rows. To avoid that, use the logical index without wrapping with which. The index will be NA in that way and that row will remain as NA even if there are other values that are non-NA.
res1 <- train[train$YOB >= 1900 & train$YOB <= 2003,]
res1[is.na(res1$YOB),]
# YOB col2
#NA NA NA
The correct way would be to have another condition with is.na
res2 <- train[is.na(train$YOB)| (train$YOB >= 1900 & train$YOB <= 2003),]
res2[is.na(res2$YOB),]
# YOB col2
#42 NA 0.2258094
Using a simple example
set.seed(25)
d1 <- data.frame(v1 = c(NA, 1, 5), v2 = rnorm(3))
d1$v1 >1
#[1] NA FALSE TRUE
Here, the NA value remains as such. If we use which
which(d1$v1 >1)
#[1] 3
we get only the index of the TRUE values. According to OP, both the NA and the rows that satisfy the logical condition should return. In that case,
d1[is.na(d1$v1)|d1$v1 > 1,]
# v1 v2
#1 NA -0.2118336
#3 5 -1.1533076
set.seed(29)
train <- data.frame(YOB = sample(c(NA, 1850:2015), 100, replace=TRUE),
col2 = rnorm(100))
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With