I have to check huge databases with repeated measures of several variables of individuals. As I can have more than 3 million of observations, I would like to remove at least the data that I'm sure that are data entry errors.
Continuous variables
For example, focusing on the variable weight (e.g. dataframe below), I know that the individuals cannot reduce their weight more than 40% between one observation and the next one. How can I detect the observations that have a higher weight loss as in the third observation of the individual "2" which has reduced its weight from 30 grams to 3 grams.
Categorical variables
For example, regarding to the status of the individuals. One individual may be classified as 3 statuses (e.g "juvenile", "adult non breeder" or "adult breeder"; 1, 2 and 3 respectively). I know that one individual cannot become juvenile ("1") if it is adult ("2" or "3"), but it is possible a transition between 3-->2. In this particular case I would like to detect the observation 9 where the individual "3" has been classified as "juvenile" but in the previous observation was classified as "adult".
Individuals <- c(1,1,1,2,2,2,3,3,3)
Weight <- c(10, 14, 20, 15, 30, 3, 12, 34, 30)
Week <- rep(1:3, 3)
Status <- c(1, 2, 3, 2, 3, 3, 2, 3, 1)
df <- as.data.frame (cbind(Individuals, Weight, Week, Status))
df
Individuals Weight Week Status
1 1 10 1 1
2 1 14 2 2
3 1 20 3 3
4 2 15 1 2
5 2 30 2 3
6 2 3 3 3
7 3 12 1 2
8 3 34 2 3
9 3 30 3 1
Do you know how can I solve these two kind of errors?
I hope this helps.
library(data.table)
library(zoo)
df <- data.table(df)
# used to check percentage change in weight variable
calcreduction <- function(x){
res <- diff(x)/x[-length(x)]
return(c(0,res))
}
# this will make it easy to get rid of values where WeightReduction < -.4
#function used to assign combination type
# you can have 11,12,13,22,23,32,33 or 21,31. The latter are "bad"
getcomb <- function(x){
res <- rbind(c(0,0),rollapply(x,2,paste))
return(paste(res[,1],res[,2],sep=""))
}
# this will make it easy to get rid of values where the Status change is no good
# you can just pull the new vectors and then use logic
# to decide what you want to do with these values
res <- df[,list("WeightReduction"=calcreduction(Weight),
"StatusChange"=getcomb(Status),Weight,Week,Status),by=Individuals]
> res
Individuals WeightReduction StatusChange Weight Week Status
1: 1 0.0000000 00 10 1 1
2: 1 0.4000000 12 14 2 2
3: 1 0.4285714 23 20 3 3
4: 2 0.0000000 00 15 1 2
5: 2 1.0000000 23 30 2 3
6: 2 -0.9000000 33 3 3 3
7: 3 0.0000000 00 12 1 2
8: 3 1.8333333 23 34 2 3
9: 3 -0.1176471 31 30 3 1
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With