I am working with hospital discharge data. All hospitalizations (cases) with the same Patient_ID are supposed to be of the same person. However I figured out that there are Pat_ID's with different ages and both sexes.
Imagine I have a data set like this:
Case_ID <- 1:8
Pat_ID <- c(rep("1",4), rep("2",3),"3")
Sex <- c(rep(1,4), rep(2,2),1,1)
Age <- c(rep(33,3),76,rep(19,2),49,15)
Pat_File <- data.frame(Case_ID, Pat_ID, Sex,Age)
Case_ID Pat_ID Sex Age
1 1 1 33
2 1 1 33
3 1 1 33
4 1 1 76
5 2 2 19
6 2 2 19
7 2 1 49
8 3 1 15
It was relatively easy to identify Pat_ID's with cases that differ from each other. I found these ID's by calculating an average for age and/or sex (coded as 1 and 2) with help of the function aggregate and then calculated the difference between the average and age or sex. I would like to automatically remove/identify cases where age or sex deviate from the majority of the cases of a patient ID. In my example I would like to remove cases 4 and 7.
You could try
library(data.table)
Using Mode from
Is there a built-in function for finding the mode?
Mode <- function(x) {
ux <- unique(x)
ux[which.max(tabulate(match(x, ux)))]
}
setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
# Pat_ID Case_ID Sex Age
#1: 1 1 1 33
#2: 1 2 1 33
#3: 1 3 1 33
#4: 2 5 2 19
#5: 2 6 2 19
#6: 3 8 1 15
Testing other cases,
Pat_File$Sex[6] <- 1
Pat_File$Age[4] <- 16
setDT(Pat_File)[, .SD[Age==Mode(Age) & Sex==Mode(Sex)] , by=Pat_ID]
# Pat_ID Case_ID Sex Age
#1: 1 1 1 33
#2: 1 2 1 33
#3: 1 3 1 33
#4: 2 6 1 19
#5: 3 8 1 15
This method works, I believe, though I doubt it's the quickest or most efficient way.
Essentially I split the dataframe by your grouping variable. Then I found the 'mode' for the variables you're concerned about. Then we filtered those observations that didn't contain all of the modes. We then stuck everything back together:
library(dplyr) # I used dplyr to 'filter' though you could do it another way
temp <- split(Pat_File, Pat_ID)
Mode.Sex <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Sex)); names(temp1)[temp1 == max(temp1)]})
Mode.Age <- lapply(temp, function(x) { temp1 <- table(as.vector(x$Age)); names(temp1)[temp1 == max(temp1)]})
temp.f<-NULL
for(i in 1:length(temp)){
temp.f[[i]] <- temp[[i]] %>% filter(Sex==Mode.Sex[[i]] & Age==Mode.Age[[i]])
}
do.call("rbind", temp.f)
# Case_ID Pat_ID Sex Age
#1 1 1 1 33
#2 2 1 1 33
#3 3 1 1 33
#4 5 2 2 19
#5 6 2 2 19
#6 8 3 1 15
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With