Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

backfill pandas dataframe column using a condition

I have a pandas dataframe with 50 million records and what I am trying to do is backfill based on a condition. As we can see that the timestamps for name 800A and Barber align so I assume that the data belongs to same name and it is just an error while recording the data. The same goes with name Mia.

This is just the sample data.

my dataframe looks like this.

datetime name dischargeDate HR Sp x_inc vs_inc rec_num 01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445 01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445 01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE
01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE
01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE
01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783 03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE
03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE
03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE
03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

Now I am trying to backfill the record numbers(rec_num) column until it maps the bool condition False False in both the x_inc and vs_inc columns.

Actual output:

datetime name dischargeDate HR Sp x_inc vs_inc rec_num 01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445 01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445 01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE 7192783 01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE 7192783 01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783 01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783 01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783 01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783 03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE 6728923 03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE 6728923 03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923 03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923 03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923 03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

Expected output:

datetime name dischargeDate HR Sp x_inc vs_inc rec_num 01-05 18:04:50 Zawisza 14-01-05 18:05:00 119 98 FALSE TRUE 6458445 01-05 18:04:55 Zawisza 14-01-05 18:05:00 120 97 FALSE TRUE 6458445 01-05 18:05:00 Zawisza 14-01-05 18:05:00 FALSE FALSE
01-29 17:58:45 800A 14-01-29 17:59:10 FALSE FALSE
01-29 17:58:50 800A 14-01-29 17:59:10 139 FALSE TRUE 7192783 01-29 17:58:55 800A 14-01-29 17:59:10 138 FALSE TRUE 7192783 01-29 17:59:00 800A 14-01-29 17:59:10 138 96 FALSE TRUE 7192783 01-29 17:59:15 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:20 Barber 14-01-29 18:17:15 138 96 FALSE TRUE 7192783 01-29 17:59:25 Barber 14-01-29 18:17:15 138 95 FALSE TRUE 7192783 03-04 21:19:45 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:53:10 800A 15-03-05 01:00:15 FALSE FALSE
03-05 00:55:50 800A 15-03-05 01:00:15 94 FALSE TRUE 6728923 03-05 00:55:55 800A 15-03-05 01:00:15 81 93 FALSE TRUE 6728923 03-05 00:56:00 800A 15-03-05 01:00:15 89 93 FALSE TRUE 6728923 03-05 01:00:20 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:25 Mia 15-03-05 04:13:15 70 93 FALSE TRUE 6728923 03-05 01:00:30 Mia 15-03-05 04:13:15 70 94 FALSE TRUE 6728923

I am using df['rec_num'].fillna(method='bfill') but it fills completely which is not my ideal solution. I would appreciate if I can get any suggestions to this problem(or if there is any better approach). Thanks in advance.

like image 627
Abalan Musk Avatar asked Oct 27 '25 06:10

Abalan Musk


1 Answers

Using a boolean mask and np.where() you can use this:

cond=(df.x_inc == False) & (df.vs_inc == False) #creates a boolean mask where both columns are false
df['new_rec']=np.where(~cond,df.rec_num.bfill(),df.rec_num) #does a backfill on where condition is not met
print(df)

Note : you can reassign the values to the old column named rec_num instead of creating a new column. I added that so you could compare. Also this should be the fastest method since vectorized

    datetime            name    dischargeDate       HR      Sp      x_inc   vs_inc  rec_num     new_rec
0   2019-05-01 18:04:50 Zawisza 2005-01-14 18:05:00 119.0   98.0    False   True    6458445.0   6458445.0
1   2019-05-01 18:04:55 Zawisza 2005-01-14 18:05:00 120.0   97.0    False   True    6458445.0   6458445.0
2   2019-05-01 18:05:00 Zawisza 2005-01-14 18:05:00 NaN     NaN     False   False   NaN         NaN
3   2029-01-01 17:58:45 800A    2029-01-14 17:59:10 NaN     NaN     False   False   NaN         NaN
4   2029-01-01 17:58:50 800A    2029-01-14 17:59:10 139.0   NaN     False   True    NaN         7192783.0
5   2029-01-01 17:58:55 800A    2029-01-14 17:59:10 138.0   NaN     False   True    NaN         7192783.0
...........................................................
...........................................................
....................................................
.....................................
like image 85
anky Avatar answered Oct 29 '25 09:10

anky



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!