Select rows from a dataframe based on a pattern formed by consecutive values in a column

Question

I'm average at R, and I would like some help with the following operation.

Let's say I have the following dataframe:

    >df
    ID   Label
    P1   M
    P1   S
    P2   M
    P2   M
    P2   S
    P3   M
    P3   S
    P3   M
    P4   S
    P4   M
    P5   M
    P5   M
    P5   S

I want to be able to select rows that occur in a particular sequence of the variable Label with respect to each ID.

For a pattern "MS", the expected output would be

    ID   Label
    P1   M
    P1   S
    P2   M
    P2   S
    P3   M
    P3   S

and for a pattern "MMS", the expected output would be

    ID   Label
    P2   M
    P2   M
    P2   S
    P5   M
    P5   M
    P5   S

and for pattern "SM", the expected output is:

    ID   Label
    P3   S
    P3   M
    P4   S
    P4   M

Please consider the fact that the data I am working on has a lot of rows and the solution I need to build needs to work for patterns of arbitrary lengths. (ex. "MSS", "SM", "MMSSMS", etc). I humbly request for your assistance.

Edit: I have updated this question (the example dataframe and the output example of pattern "MMS". I want to add that I want the pattern matching to happen after grouping the data using ID variable so that patterns can be found in the groups of data grouped by ID. Sorry for not being clear the first time.

Final Edit: Answers from @akrun, @boski, and @tmfmnk are working for me. @boski's and @akrun's solutions were faster in execution time (~2-10 seconds on 400k rows of data) compared to @tmfmnk's solution (~29 seconds on 400k rows of data). I advise the reader to refer all three of these solutions.

akrun · Accepted Answer

One option would be to compare the lead values and get the index grouped by 'ID'

library(data.table)
i1 <- unique(setDT(df)[, lapply(which(Reduce(`&`, 
  Map(`==`, shift(Label, n = 0:2, type = "lead"), c("M", "M", "S")))), 
       function(i) .I[i:(i+2)]) , by = ID]$V1)
df[i1]
#    ID Label
#1: P2     M
#2: P2     M
#3: P2     S
#4: P5     M
#5: P5     M
#6: P5     S

data

df <- structure(list(ID = c("P1", "P1", "P2", "P2", "P2", "P3", "P3", 
"P3", "P4", "P4", "P5", "P5", "P5"), Label = c("M", "S", "M", 
"M", "S", "M", "S", "M", "S", "M", "M", "M", "S")), 
class = "data.frame", row.names = c(NA, -13L))

boski · Answer

You can try using gregexpr(). First paste all labels and find the start position of the pattern you are looking for.

> df
   ID Label
1  P1     M
2  P1     S
3  P2     M
4  P2     M
5  P2     S
6  P3     M
7  P3     S
8  P3     S
9  P4     S
10 P4     M
11 P5     M
12 P5     M
13 P5     S

Edit

My previous solution did not retrieve the whole pattern (just the start).

pattern="SM"
starts=gregexpr(pattern=pattern,paste(df$Label,collapse=""))[[1]]
positions=as.vector(sapply(starts,function(x){ 
  s=seq(x,x+nchar(pattern)-1)
  if (all(df$ID[s]==df$ID[x])){
    return(s)
  } else {return(rep(NA,nchar(pattern)))}
  }))
positions=positions[which(!is.na(positions))]

df[positions,]
df[positions,]
   ID Label
1  P1     M
2  P1     S
4  P2     M
5  P2     S
6  P3     M
7  P3     S
12 P5     M
13 P5     S

pattern="MMS"
   ID Label
3  P2     M
4  P2     M
5  P2     S
11 P5     M
12 P5     M
13 P5     S

pattern="SM"
   ID Label
9  P4     S
10 P4     M

Select rows from a dataframe based on a pattern formed by consecutive values in a column

Tags:

jerome

2 Answers

data

akrun

boski

Recent Activity

Donate For Us

Select rows from a dataframe based on a pattern formed by consecutive values in a column

Tags:

jerome

2 Answers

data

akrun

boski

Related questions

Recent Activity

Donate For Us