Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

python pandas groupby to identify rows

Tags:

python

pandas

I used to clean data using SAS but I would like to switch to Python.

I had a large dataset which was scrapped from some filings (html) but included some noisy information and I would like to get rid of these irrelevant data.

Basically, I need to remove certain rows of data after a row with condition being True (however, this might be a list, multiple True/or no True at all; and if there are Trues, I want to identify the last one).

Raw data:

<table>
  <tr>
    <td>Report_ID</td>
    <td>Table_ID</td>
    <td>Group_ID</td>
    <td>Item_ID</td>
    <td>Flag_old</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item1</td>
    <td>0</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item2</td>
    <td>0</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item3</td>
    <td>1</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item4</td>
    <td>0</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item5</td>
    <td>0</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item1</td>
    <td>1</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item2</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item3</td>
    <td>1</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item4</td>
    <td>0</td>
  </tr>
        <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item1</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item2</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item3</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item4</td>
    <td>0</td>
  </tr>
</table>

Expected data:

<table>
  <tr>
    <td>Report_ID</td>
    <td>Table_ID</td>
    <td>Group_ID</td>
    <td>Item_ID</td>
    <td>Flag_old</td>
    <td>Flag_new</td>
  </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item1</td>
    <td>0</td>
    <td>0</td>    
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item2</td>
    <td>0</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item3</td>
    <td>1</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item4</td>
    <td>0</td>
    <td>1</td>
    </tr>
        <tr>
    <td>A</td>
    <td>1</td>
    <td>1</td>
    <td>item5</td>
    <td>0</td>
    <td>1</td>
    </tr>
  <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item1</td>
    <td>1</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item2</td>
    <td>0</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item3</td>
    <td>1</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>2</td>
    <td>item4</td>
    <td>0</td>
    <td>1</td>
  </tr>
        <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item1</td>
    <td>0</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item2</td>
    <td>0</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item3</td>
    <td>0</td>
    <td>0</td>
  </tr>
    <tr>
    <td>A</td>
    <td>1</td>
    <td>3</td>
    <td>item4</td>
    <td>0</td>
    <td>0</td>
  </tr>
</table>

As you can see from the above, I wanted to identify rows below the rows with condition Flag_old == 1.

Given the structure of the data, I have firstly use groupby to segment my whole dataframe and I was thinking to define a function to select rows and apply the function to the dataframe groupby object and then of course create a new column for the whole dataframe indicating these rows of noisy data.

def lastline(series):
    return max(series[series.values == 1].index)

df['lastline'] = df.groupby('id').apply(lastline(df['flag']))

but I got 'int' object is not callable error.

Could you please advice me how to do this properly? I have been struggling with this for few days now...Many thanks.

like image 890
Jin Avatar asked Nov 22 '25 17:11

Jin


1 Answers

I think you need custom function with transform for return new column:

def f(x):
    #get cumulative sum, shift
    a = x.cumsum().shift()
    #check max value of cumsumed a and chain condition for remove 0 only groups
    #convert Trues to 1 by astype
    return ((a == a.max()) & (a != 0)).astype(int)

df['Flag_new'] = df.groupby('Group_ID')['Flag_old'].transform(f)
print (df)
   Report_ID  Table_ID  Group_ID Item_ID  Flag_old  Flag_new
0          A         1         1   item1         0         0
1          A         1         1   item2         0         0
2          A         1         1   item3         1         0
3          A         1         1   item4         0         1
4          A         1         1   item5         0         1
5          A         1         2   item1         1         0
6          A         1         2   item2         0         0
7          A         1         2   item3         1         0
8          A         1         2   item4         0         1
9          A         1         3   item1         0         0
10         A         1         3   item2         0         0
11         A         1         3   item3         0         0
12         A         1         3   item4         0         0
like image 119
jezrael Avatar answered Nov 24 '25 06:11

jezrael



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!