I'm average at R, and I would like some help with the following operation.
Let's say I have the following dataframe:
>df
ID Label
P1 M
P1 S
P2 M
P2 M
P2 S
P3 M
P3 S
P3 M
P4 S
P4 M
P5 M
P5 M
P5 S
I want to be able to select rows that occur in a particular sequence of the variable Label with respect to each ID.
For a pattern "MS", the expected output would be
ID Label
P1 M
P1 S
P2 M
P2 S
P3 M
P3 S
and for a pattern "MMS", the expected output would be
ID Label
P2 M
P2 M
P2 S
P5 M
P5 M
P5 S
and for pattern "SM", the expected output is:
ID Label
P3 S
P3 M
P4 S
P4 M
Please consider the fact that the data I am working on has a lot of rows and the solution I need to build needs to work for patterns of arbitrary lengths. (ex. "MSS", "SM", "MMSSMS", etc). I humbly request for your assistance.
Edit: I have updated this question (the example dataframe and the output example of pattern "MMS". I want to add that I want the pattern matching to happen after grouping the data using ID variable so that patterns can be found in the groups of data grouped by ID. Sorry for not being clear the first time.
Final Edit: Answers from @akrun, @boski, and @tmfmnk are working for me. @boski's and @akrun's solutions were faster in execution time (~2-10 seconds on 400k rows of data) compared to @tmfmnk's solution (~29 seconds on 400k rows of data). I advise the reader to refer all three of these solutions.
One option would be to compare the lead values and get the index grouped by 'ID'
library(data.table)
i1 <- unique(setDT(df)[, lapply(which(Reduce(`&`,
Map(`==`, shift(Label, n = 0:2, type = "lead"), c("M", "M", "S")))),
function(i) .I[i:(i+2)]) , by = ID]$V1)
df[i1]
# ID Label
#1: P2 M
#2: P2 M
#3: P2 S
#4: P5 M
#5: P5 M
#6: P5 S
df <- structure(list(ID = c("P1", "P1", "P2", "P2", "P2", "P3", "P3",
"P3", "P4", "P4", "P5", "P5", "P5"), Label = c("M", "S", "M",
"M", "S", "M", "S", "M", "S", "M", "M", "M", "S")),
class = "data.frame", row.names = c(NA, -13L))
You can try using gregexpr(). First paste all labels and find the start position of the pattern you are looking for.
> df
ID Label
1 P1 M
2 P1 S
3 P2 M
4 P2 M
5 P2 S
6 P3 M
7 P3 S
8 P3 S
9 P4 S
10 P4 M
11 P5 M
12 P5 M
13 P5 S
Edit
My previous solution did not retrieve the whole pattern (just the start).
pattern="SM"
starts=gregexpr(pattern=pattern,paste(df$Label,collapse=""))[[1]]
positions=as.vector(sapply(starts,function(x){
s=seq(x,x+nchar(pattern)-1)
if (all(df$ID[s]==df$ID[x])){
return(s)
} else {return(rep(NA,nchar(pattern)))}
}))
positions=positions[which(!is.na(positions))]
df[positions,]
df[positions,]
ID Label
1 P1 M
2 P1 S
4 P2 M
5 P2 S
6 P3 M
7 P3 S
12 P5 M
13 P5 S
pattern="MMS"
ID Label
3 P2 M
4 P2 M
5 P2 S
11 P5 M
12 P5 M
13 P5 S
pattern="SM"
ID Label
9 P4 S
10 P4 M
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With