I have a data frame with MRN, dates, and a test value.
I need to select all the first rows per MRN that have three consecutive values above 0.5.
This is an example version of the data:
MRN Collected_Date ANC
1 001 2015-01-02 0.345
2 001 2015-01-03 0.532
3 001 2015-01-04 0.843
4 001 2015-01-05 0.932
5 002 2015-03-03 0.012
6 002 2015-03-05 0.022
7 002 2015-03-06 0.543
8 002 2015-03-07 0.563
9 003 2015-08-02 0.343
10 003 2015-08-03 0.500
11 003 2015-08-04 0.734
12 003 2015-08-05 0.455
13 004 2014-01-02 0.001
14 004 2014-01-03 0.500
15 004 2014-01-04 0.562
16 004 2014-01-05 0.503
Example code:
df <- data.frame(MRN = c('001','001','001','001',
'002','002','002','002',
'003','003','003','003',
'004','004','004','004'),
Collected_Date = as.Date(c('01-02-2015','01-03-2015','01-04-2015','01-05-2015',
'03-03-2015','03-05-2015','03-06-2015','03-07-2015',
'08-02-2015','08-03-2015','08-04-2015','08-05-2015',
'01-02-2014','01-03-2014','01-04-2014','01-05-2014'),
format = '%m-%d-%Y'),
ANC = as.numeric(c('0.345','0.532','0.843','0.932',
'0.012','0.022','0.543','0.563',
'0.343','0.500','0.734','0.455',
'0.001','0.500','0.562','0.503')))
Currently, I am using a very awkward approach using the lag function to calculate the date difference, then filter for all values >= 0.5, and then sum up the values, which helps to select the date of the THIRD value. I then substract two days to get the date of the first value:
df %>% group_by(MRN) %>%
mutate(., days_diff = abs(Collected_Date[1] - Collected_Date)) %>%
filter(ANC >= 0.5) %>%
mutate(days = days_diff + lag((days_diff))) %>%
filter(days == 5) %>%
mutate(Collected_Date = Collected_Date - 2) %>%
select(MRN, Collected_Date)
Output:
Source: local data frame [2 x 2] Groups: MRN
MRN Collected_Date
1 001 2015-01-03
2 004 2014-01-03
There must be a way simpler / more elegant way. Also, it does not give accurate results if there are gaps between the test dates.
My desired output for this example is:
MRN Collected_Date ANC
1 001 2015-01-03 0.532
2 004 2014-01-03 0.500
So if at least three consecutive test values are >= 0.5, the date of the FIRST value should be returned.
If there are not at least three consecutive values >= 0.5, NA should be returned.
Any help is greatly appreciated!
Thank you very much!
The easiest way is to use the zoo library in conjunction with dplyr. Within the zoo package there is a function called rollapply, we can use this to calculate a function value for a window of time.
In this example, we could apply the window to calculate the minimum of the next three values, and then apply the logic specified.
df %>% group_by(MRN) %>%
mutate(ANC=rollapply(ANC, width=3, min, align="left", fill=NA, na.rm=TRUE)) %>%
filter(ANC >= 0.5) %>%
filter(row_number() == 1)
# MRN Collected_Date ANC
# 1 001 2015-01-03 0.532
# 2 004 2014-01-03 0.500
In the code above we have used rollapply to calculate the minimum of the next 3 items. To see how this works compare the following:
rollapply(1:6, width=3, min, align="left", fill=NA) # [1] 1 2 3 4 NA NA
rollapply(1:6, width=3, min, align="center", fill=NA) # [1] NA 1 2 3 4 NA
rollapply(1:6, width=3, min, align="right", fill=NA) # [1] NA NA 1 2 3 4
So in our example, we have aligned from the left, so it starts from the current location and looks forward to the next 2 values.
Lastly we filter by the appropriate values, and take the first observation of each group.
Base approach:
Use rle to find sequences of 3 or more and grab the first one
df <- data.frame(MRN = c('001','001','001','001','002','002','002','002','003','003','003','003','004','004','004','004'), Collected_Date = as.Date(c('01-02-2015','01-03-2015','01-04-2015','01-05-2015', '03-03-2015','03-05-2015','03-06-2015','03-07-2015', '08-02-2015','08-03-2015','08-04-2015','08-05-2015', '01-02-2014','01-03-2014','01-04-2014','01-05-2014'), format = '%m-%d-%Y'), ANC = as.numeric(c('0.345','0.532','0.843','0.932', '0.012','0.022','0.543','0.563', '0.343','0.500','0.734','0.455', '0.001','0.500','0.562','0.503')))
df[as.logical(with(df, ave(ANC, MRN, FUN = function(x)
cumsum(x >= .5 & with(rle(x >= .5), rep(lengths, lengths)) >= 3) == 1))), ]
# MRN Collected_Date ANC
# 2 001 2015-01-03 0.532
# 14 004 2014-01-03 0.500
Maybe this version is easier to understand
df[as.logical(with(df, ave(ANC, MRN, FUN = function(x) {
r <- rle(x >= .5)
r <- rep(r$lengths, r$lengths)
cumsum(r == 3 & x >= .5) == 1
}))), ]
edit
df <- df[c(1:4,4,4,4,5,5,5,5:16), ]
df[as.logical(with(df, ave(ANC, MRN, FUN = function(x)
cumsum(x >= .5 & with(rle(x >= .5), rep(lengths, lengths)) >= 3) == 1))), ]
# MRN Collected_Date ANC
# 2 001 2015-01-03 0.532
# 14 004 2014-01-03 0.500
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With