Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding the first row that meets conditions of a mask and selecting one row after it

Tags:

python

pandas

This is my dataframe:

import pandas as pd 
df = pd.DataFrame(
    {
        'a': [100, 1123, 123, 100, 1, 0, 1],
        'b': [1000, 11123, 1123, 0, 55, 0, 1],
        'c': ['a', 'b', 'c', 'd', 'e', 'f', 'g'],
    }
)

And this is the output that I want. I want to create column x:

     a      b  c    x
0   100   1000  a  NaN
1  1123  11123  b  NaN
2   123   1123  c  NaN
3   100      0  d  NaN
4     1     55  e  e
5     0      0  f  NaN
6     1      1  g  NaN

By using a mask:

mask = (
    (df.a > df.b)
)

First of all I need to find the first occurrence of this mask which in my example is row number 3. Then I want to move one row below it and use the value in column c to create column x.

So in my example, the first occurrence of mask is row 3. One row after it is row 4. That is why e is selected for column x.

Note that in row 4 which is one row after the mask, no condition is needed. For example for row 4, It is NOT necessary that df.a > df.b.

This is what I have tried:

df.loc[mask.cumsum().eq(1) & mask, 'x'] = df.c.shift(-1)

I provide some additional dfs for convenience to test whether the code works in other examples. For instance what if there are no cases that meet the conditions of mask. In that case I just want a column of NaN for x.

df = pd.DataFrame({'a': [1000, 11230, 12300, 10000, 1000, 10000, 100000], 'b': [1000, 11123, 1123, 0, 55, 0, 1], 'c': ['a', 'b', 'c', 'd', 'e', 'f', 'g']})
df = pd.DataFrame({'a': [1, 1, 1, -1, -1, -1, -1], 'b': [1000, 11123, 1123, 0, 55, 0, 1], 'c': ['a', 'b', 'c', 'd', 'e', 'f', 'g']})
df = pd.DataFrame({'a': [-1, -1, -1, -1, -1, -1, 100000], 'b': [1000, 11123, 1123, 0, 55, 0, 1], 'c': ['a', 'b', 'c', 'd', 'e', 'f', 'g']})
like image 313
Amir Avatar asked Sep 03 '25 17:09

Amir


2 Answers

You can generate a mask that indicates one location past where the first value of a is greater than b:

mask = (df.a > df.b).shift(fill_value=False)
mask = mask & ~mask.cumsum().shift().astype(bool)

You can then use that mask to set the value of x equal to c:

df.loc[mask, 'x'] = df['c']

Output for each of your dfs:

      a      b  c    x
0   100   1000  a  NaN
1  1123  11123  b  NaN
2   123   1123  c  NaN
3   100      0  d  NaN
4     1     55  e    e
5     0      0  f  NaN
6     1      1  g  NaN

        a      b  c    x
0    1000   1000  a  NaN
1   11230  11123  b  NaN
2   12300   1123  c    c
3   10000      0  d  NaN
4    1000     55  e  NaN
5   10000      0  f  NaN
6  100000      1  g  NaN

   a      b  c    x
0  1   1000  a  NaN
1  1  11123  b  NaN
2  1   1123  c  NaN
3 -1      0  d  NaN
4 -1     55  e  NaN
5 -1      0  f  NaN
6 -1      1  g  NaN

        a      b  c    x
0      -1   1000  a  NaN
1      -1  11123  b  NaN
2      -1   1123  c  NaN
3      -1      0  d  NaN
4      -1     55  e  NaN
5      -1      0  f  NaN
6  100000      1  g  NaN

More generically, you can use cummax and shift by N to select the next N values:

N = 3
mask = (df.a > df.b).shift(fill_value=False).cummax()
mask = mask & ~mask.cumsum().shift(N, fill_value=0).astype(bool)
like image 113
Nick Avatar answered Sep 05 '25 05:09

Nick


A generic approach could be to cummax+cumsum then select the desired position(s):

N = 3
mask = ((df.a > df.b)
        .cummax().cumsum()
        .between(2, N+1)
        )

df.loc[mask, 'x'] = df.loc[mask, 'c']

Output:

      a      b  c    x
0   100   1000  a  NaN
1  1123  11123  b  NaN
2   123   1123  c  NaN
3   100      0  d  NaN
4     1     55  e    e
5     0      0  f    f
6     1      1  g    g

Intermediates:

      a      b  c    x  cumsum   mask
0   100   1000  a  NaN       0  False
1  1123  11123  b  NaN       0  False
2   123   1123  c  NaN       0  False
3   100      0  d  NaN       1  False
4     1     55  e    e       2   True
5     0      0  f    f       3   True
6     1      1  g    g       4   True
like image 20
mozway Avatar answered Sep 05 '25 05:09

mozway