Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

extract consecutive rows with similar values in a column more with a specific patch size

I was looking out to extract consecutive rows with specified text repeated continuously for more than 5 times.

ex:

 A  B   C 
10  john    1
12  paul    1
23  kishan  1
12  teja    1
12  zebo    1
324 vauh    -1
3434    krish   -1
232 poo -1
4535    zoo 1
4343    doo 1
342 foo -1
123 soo 1
121 koo -1
34  loo -1
343454  moo -1
565343  noo -1
2323234 voo -1
3434    coo 1
545 xoo 1
6565    zoo 1
232321  qoo 1
34454   woo 1
546556  eoo 1
65665   roo -1
5343    too -1
3232    yoo 1
1212    uoo 1
23355667    ioo 1
787878  joo -1

I am looking out for the below result where the column value 'c' has consecutive 1's repeated more than 4 times as different groups .

Output:

A   B   C   group
10  john    1   1
12  paul    1   1
23  kishan  1   1
12  teja    1   1
12  zebo    1   1
3434 coo    1   2
545  xoo    1   2
6565    zoo 1   2
232321  qoo 1   2
34454   woo 1   2
546556  eoo 1   2
like image 373
pylearner Avatar asked Dec 12 '25 16:12

pylearner


2 Answers

Using masks and factorize:

# identify 1s
m = df['C'].eq(1)
# group consecutive values
g = m.ne(m.shift()).cumsum()
# identify stretches of 5+ 1s
m2 = m & df.groupby(g)['C'].transform('size').ge(5)

out = (df.loc[m2]
         .assign(group=pd.factorize(g[m2])[0]+1)
       )

Output:

         A       B  C  group
0       10    john  1      1
1       12    paul  1      1
2       23  kishan  1      1
3       12    teja  1      1
4       12    zebo  1      1
17    3434     coo  1      2
18     545     xoo  1      2
19    6565     zoo  1      2
20  232321     qoo  1      2
21   34454     woo  1      2
22  546556     eoo  1      2
like image 73
mozway Avatar answered Dec 14 '25 05:12

mozway


You can groupby the difference of col C, get the cumsum where the difference does not equal zero and transform the size of each group so you can only keep the groups that are greater than or equal to five

df[df['C'].eq(1) & df.groupby(df['C'].diff().ne(0).cumsum()).transform('size').gt(4)]

         A       B  C
0       10    john  1
1       12    paul  1
2       23  kishan  1
3       12    teja  1
4       12    zebo  1
17    3434     coo  1
18     545     xoo  1
19    6565     zoo  1
20  232321     qoo  1
21   34454     woo  1
22  546556     eoo  1

If you want a group column

# create the groups by calculating the diff and getting the cumsum
df['group'] = df['C'].diff().ne(0).cumsum()
# boolean indexing to keep values where C == 1 AND the size of each group is greater than 4
df[df['C'].eq(1) & df.groupby('group').transform('size').gt(4)]

         A       B  C  group
0       10    john  1      1
1       12    paul  1      1
2       23  kishan  1      1
3       12    teja  1      1
4       12    zebo  1      1
17    3434     coo  1      7
18     545     xoo  1      7
19    6565     zoo  1      7
20  232321     qoo  1      7
21   34454     woo  1      7
22  546556     eoo  1      7
like image 43
It_is_Chris Avatar answered Dec 14 '25 05:12

It_is_Chris



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!