Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Detect impossible data entry errors from repeated measures in data frames

Tags:

dataframe

r

I have to check huge databases with repeated measures of several variables of individuals. As I can have more than 3 million of observations, I would like to remove at least the data that I'm sure that are data entry errors.

Continuous variables

For example, focusing on the variable weight (e.g. dataframe below), I know that the individuals cannot reduce their weight more than 40% between one observation and the next one. How can I detect the observations that have a higher weight loss as in the third observation of the individual "2" which has reduced its weight from 30 grams to 3 grams.

Categorical variables

For example, regarding to the status of the individuals. One individual may be classified as 3 statuses (e.g "juvenile", "adult non breeder" or "adult breeder"; 1, 2 and 3 respectively). I know that one individual cannot become juvenile ("1") if it is adult ("2" or "3"), but it is possible a transition between 3-->2. In this particular case I would like to detect the observation 9 where the individual "3" has been classified as "juvenile" but in the previous observation was classified as "adult".

Individuals <- c(1,1,1,2,2,2,3,3,3)
Weight <- c(10, 14, 20, 15, 30, 3, 12, 34, 30)
Week <- rep(1:3, 3)
Status <- c(1, 2, 3, 2, 3, 3, 2, 3, 1)
df <- as.data.frame (cbind(Individuals, Weight, Week, Status))
df

        Individuals Weight Week Status
1           1     10    1      1
2           1     14    2      2
3           1     20    3      3
4           2     15    1      2
5           2     30    2      3
6           2      3    3      3
7           3     12    1      2
8           3     34    2      3
9           3     30    3      1

Do you know how can I solve these two kind of errors?

like image 529
Ruben Avatar asked Mar 23 '26 00:03

Ruben


1 Answers

I hope this helps.

library(data.table)
  library(zoo)
  df <- data.table(df)
  # used to check percentage change in weight variable
  calcreduction <- function(x){
    res <- diff(x)/x[-length(x)]
    return(c(0,res))
  }
  # this will make it easy to get rid of values where WeightReduction < -.4

  #function used to assign combination type
  # you can have 11,12,13,22,23,32,33 or 21,31. The latter are "bad"
  getcomb <- function(x){
    res <- rbind(c(0,0),rollapply(x,2,paste))
    return(paste(res[,1],res[,2],sep=""))
  } 
  # this will make it easy to get rid of values where the Status change is no good

  # you can just pull the new vectors and then use logic
  # to decide what you want to do with these values
  res <- df[,list("WeightReduction"=calcreduction(Weight),
                  "StatusChange"=getcomb(Status),Weight,Week,Status),by=Individuals]

> res
   Individuals WeightReduction StatusChange Weight Week Status
1:           1       0.0000000           00     10    1      1
2:           1       0.4000000           12     14    2      2
3:           1       0.4285714           23     20    3      3
4:           2       0.0000000           00     15    1      2
5:           2       1.0000000           23     30    2      3
6:           2      -0.9000000           33      3    3      3
7:           3       0.0000000           00     12    1      2
8:           3       1.8333333           23     34    2      3
9:           3      -0.1176471           31     30    3      1
like image 96
road_to_quantdom Avatar answered Mar 24 '26 16:03

road_to_quantdom



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!