extract consecutive rows with similar values in a column more with a specific patch size

Question

I was looking out to extract consecutive rows with specified text repeated continuously for more than 5 times.

ex:

 A  B   C 
10  john    1
12  paul    1
23  kishan  1
12  teja    1
12  zebo    1
324 vauh    -1
3434    krish   -1
232 poo -1
4535    zoo 1
4343    doo 1
342 foo -1
123 soo 1
121 koo -1
34  loo -1
343454  moo -1
565343  noo -1
2323234 voo -1
3434    coo 1
545 xoo 1
6565    zoo 1
232321  qoo 1
34454   woo 1
546556  eoo 1
65665   roo -1
5343    too -1
3232    yoo 1
1212    uoo 1
23355667    ioo 1
787878  joo -1

I am looking out for the below result where the column value 'c' has consecutive 1's repeated more than 4 times as different groups .

Output:

A   B   C   group
10  john    1   1
12  paul    1   1
23  kishan  1   1
12  teja    1   1
12  zebo    1   1
3434 coo    1   2
545  xoo    1   2
6565    zoo 1   2
232321  qoo 1   2
34454   woo 1   2
546556  eoo 1   2

mozway · Accepted Answer

Using masks and factorize:

# identify 1s
m = df['C'].eq(1)
# group consecutive values
g = m.ne(m.shift()).cumsum()
# identify stretches of 5+ 1s
m2 = m & df.groupby(g)['C'].transform('size').ge(5)

out = (df.loc[m2]
         .assign(group=pd.factorize(g[m2])[0]+1)
       )

Output:

         A       B  C  group
0       10    john  1      1
1       12    paul  1      1
2       23  kishan  1      1
3       12    teja  1      1
4       12    zebo  1      1
17    3434     coo  1      2
18     545     xoo  1      2
19    6565     zoo  1      2
20  232321     qoo  1      2
21   34454     woo  1      2
22  546556     eoo  1      2

It_is_Chris · Answer

You can groupby the difference of col C, get the cumsum where the difference does not equal zero and transform the size of each group so you can only keep the groups that are greater than or equal to five

df[df['C'].eq(1) & df.groupby(df['C'].diff().ne(0).cumsum()).transform('size').gt(4)]

         A       B  C
0       10    john  1
1       12    paul  1
2       23  kishan  1
3       12    teja  1
4       12    zebo  1
17    3434     coo  1
18     545     xoo  1
19    6565     zoo  1
20  232321     qoo  1
21   34454     woo  1
22  546556     eoo  1

If you want a group column

# create the groups by calculating the diff and getting the cumsum
df['group'] = df['C'].diff().ne(0).cumsum()
# boolean indexing to keep values where C == 1 AND the size of each group is greater than 4
df[df['C'].eq(1) & df.groupby('group').transform('size').gt(4)]

         A       B  C  group
0       10    john  1      1
1       12    paul  1      1
2       23  kishan  1      1
3       12    teja  1      1
4       12    zebo  1      1
17    3434     coo  1      7
18     545     xoo  1      7
19    6565     zoo  1      7
20  232321     qoo  1      7
21   34454     woo  1      7
22  546556     eoo  1      7

extract consecutive rows with similar values in a column more with a specific patch size

Tags:

python

pandas

group-by

pylearner

2 Answers

mozway

It_is_Chris

Recent Activity

Donate For Us

extract consecutive rows with similar values in a column more with a specific patch size

Tags:

python

pandas

group-by

pylearner

2 Answers

mozway

It_is_Chris

Related questions

Recent Activity

Donate For Us