Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to make checking values for errors more efficient

I am dealing with around 700 large datasets (with ~ 80 variables and up to 200k rows per dataset). Most variables are only allowed to be within a certain range or to be certain values. There are some errors in each dataset with values that do not adhere to these rules.

So, somehow I need to find those errors and make them NA. Currently I am doing this the following way:

library(data.table)
dataset$variable[!dataset$variable %in% c(1,2,3)] <- NA
# or
dataset$variable[!(dataset$variable >= 0 &
                   dataset$variable <= 150)] <- NA
... # for 80 variables

My issue is that applying these rules takes some time (up to 10 secs) and I would like to speed up that process. Do you have any ideas how to make it more efficient? Or is this already the simplest way and I have to deal with it?

Thank you so much in advance!


Edit: I am wondering for example: Would it be faster to iterate over the rows instead of the columns?

like image 600
tristndev Avatar asked Jan 18 '26 20:01

tristndev


1 Answers

If you are looking for pure performance, I suggest you check out the data.table package, which enhances data.frames in R. data.table can do replacements in place, by reference (without copying). When you invoke the assignment operation (<-) in R, I believe that a copy of the data is made, which causes slowdowns for large data sets.

A data.table solution for your problem:

library(data.table)

# Cast as a data.table
dataset <- as.data.table(dataset)

# Do the replacement by reference with the := operation in data.table
# See ?`:=`
dataset[!variable %in% c(1,2,3), variable := NA]
dataset[variable >= 0 & variable <= 150, variable := NA]

This should be much faster than replacing values in a regular data.frame with [<-

like image 144
ialm Avatar answered Jan 20 '26 19:01

ialm



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!