I am dealing with around 700 large datasets (with ~ 80 variables and up to 200k rows per dataset). Most variables are only allowed to be within a certain range or to be certain values. There are some errors in each dataset with values that do not adhere to these rules.
So, somehow I need to find those errors and make them NA. Currently I am doing this the following way:
library(data.table)
dataset$variable[!dataset$variable %in% c(1,2,3)] <- NA
# or
dataset$variable[!(dataset$variable >= 0 &
dataset$variable <= 150)] <- NA
... # for 80 variables
My issue is that applying these rules takes some time (up to 10 secs) and I would like to speed up that process. Do you have any ideas how to make it more efficient? Or is this already the simplest way and I have to deal with it?
Thank you so much in advance!
Edit: I am wondering for example: Would it be faster to iterate over the rows instead of the columns?
If you are looking for pure performance, I suggest you check out the data.table package, which enhances data.frames in R. data.table can do replacements in place, by reference (without copying). When you invoke the assignment operation (<-) in R, I believe that a copy of the data is made, which causes slowdowns for large data sets.
A data.table solution for your problem:
library(data.table)
# Cast as a data.table
dataset <- as.data.table(dataset)
# Do the replacement by reference with the := operation in data.table
# See ?`:=`
dataset[!variable %in% c(1,2,3), variable := NA]
dataset[variable >= 0 & variable <= 150, variable := NA]
This should be much faster than replacing values in a regular data.frame with [<-
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With