I have a dataframe as below:
import pandas as pd
import numpy as np
df=pd.DataFrame({'id':[0,1,2,4,5],
'A':[0,1,0,1,0],
'B':[None,None,1,None,None]})
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
Notice that the vast majority of value in B column is NaN
id column increment by 1,so one row between id 2 and 4 is missing.
The missing row which need insert is the same as the previous row, except for id column.
So for example the result is
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0 <-add row here
4 4 1.0 NaN
5 5 0.0 NaN
I can do this on A column,but I don't know how to deal with B column as ffill will fill 1.0 at row 4 and 5,which is incorrect
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
EDIT:
sorry,I forget one sutiation.
B column will have different values.
When DataFrame is as below:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 4 1 NaN
4 5 0 NaN
5 6 1 2.0
6 9 0 NaN
7 10 1 NaN
the result would be:
id A B
0 0 0 NaN
1 1 1 NaN
2 2 0 1.0
3 3 0 1.0
4 4 1 NaN
5 5 0 NaN
6 6 1 2.0
7 7 1 2.0
8 8 1 2.0
9 9 0 NaN
10 10 1 NaN
Do the changes keep the original id , and with update isin
s=df.id.copy() #change 1
step=1
idx=np.arange(df['id'].min(), df['id'].max() + step, step)
df=df.set_index('id').reindex(idx).reset_index()
df['A']=df["A"].ffill()
df.B.update(df.B.ffill().mask(df.id.isin(s))) # change two
df
id A B
0 0 0.0 NaN
1 1 1.0 NaN
2 2 0.0 1.0
3 3 0.0 1.0
4 4 1.0 NaN
5 5 0.0 NaN
If I understand in the right way, here are some sample code.
new_df = pd.DataFrame({
'new_id': [i for i in range(df['id'].max() + 1)],
})
df = df.merge(new_df, how='outer', left_on='id', right_on='new_id')
df = df.sort_values('new_id')
df = df.ffill()
df = df.drop(columns='id')
df
A B new_id
0 0.0 NaN 0
1 1.0 NaN 1
2 0.0 1.0 2
5 0.0 1.0 3
3 1.0 1.0 4
4 0.0 1.0 5
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With