I have two Dataframes one large one with a lot of missing values and a second one with data to fill the missing data in the first one.
Dataframe examples:
In[34]:
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2], 'B': [1, 0, 1, 1, 0, 0]})
df = pd.DataFrame({'A': [0, 1, 2, 3, 4, 5], 'B1': [1, np.nan, np.nan, 8, 9, 1],'B2':[1, np.nan, np.nan, 7, 6, 1], 'B3':[1, np.nan, np.nan, 8, 7, 1] })
df=df.set_index(['A'])
df2=df2.set_index(['A'])
In[35]:
df
Out[35]:
B1 B2 B3
A
0 1 1 1
1 NaN NaN NaN
2 NaN NaN NaN
3 8 7 8
4 9 6 7
5 1 1 1
In[36]:
df2
Out[36]:
B
A
1 1
1 0
1 1
2 1
2 0
2 0
so what I want to do is fill up df using the data from df2 also taking into account that B1 is not B2 when coming across a second instance in df2. See below the desired output:
In[38]:
df
Out[38]:
B1 B2 B3
A
0 1 1 1
1 1 0 1
2 1 0 0
3 8 7 8
4 9 6 7
5 1 1 1
The NaNs in B1, B2 and B3 for 1 and 2 have been filled with the data from df2. 1 0 1 for index 1 and 1 0 0 for index 2. See below my inefficient for loop implementation:
In[37]:
count=1
seen=[]
for t in range(0, len(df2)):
if df2.index[t] not in seen:
count=1
seen.append(df2.index[t])
else:
count=count+1
tofill=pd.DataFrame(df2.iloc[t]).transpose()
tofill_dict={"B"+str(count):tofill.B}
df=df.fillna(value=tofill_dict)
This works, however when the dataset gets larger it can take a significant amount of time. So my question is if there is a way to do this faster? I have heard vectorization could work, how would you implement this? Are there any other ways to do this faster?
First you cannot reset index of df2.
You can use try groupby - each group is transposing with T and then fillna df by values of df2:
import pandas as pd
import numpy as np
df2 = pd.DataFrame({'A': [1, 1, 1, 2, 2, 2], 'B': [1, 0, 1, 1, 0, 0]})
df = pd.DataFrame({'A': [0, 1, 2, 3, 4, 5], 'B1': [1, np.nan, np.nan, 8, 9, 1],'B2':[1, np.nan, np.nan, 7, 6, 1], 'B3':[1, np.nan, np.nan, 8, 7, 1] })
df=df.set_index(['A'])
df2=df2.set_index(['A'])
print df
B1 B2 B3
A
0 1 1 1
1 NaN NaN NaN
2 NaN NaN NaN
3 8 7 8
4 9 6 7
5 1 1 1
print df2
A B
0 1 1
1 1 0
2 1 1
3 2 1
4 2 0
5 2 0
df2 = df2.groupby(df2.index).apply(lambda x: x.B.reset_index(drop=True).T)
df2.columns = df.columns
print df2
B1 B2 B3
A
1 1 0 1
2 1 0 0
df = df.fillna(df2)
print df
B1 B2 B3
A
0 1 1 1
1 1 0 1
2 1 0 0
3 8 7 8
4 9 6 7
5 1 1 1
Maybe if df = df.fillna(df2) doesn't work, can be use df = df.combine_first(df2). It depends on index.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With