In an exercise, I was asked to merge 3 DataFrames with inner join (df1+df2+df3 = mergedDf), then in another question I was asked to tell how many entries I've lost when performing this 3-way merging.
#DataFrame1
df1 = pd.DataFrame(columns=["Goals","Medals"],data=[[5,2],[1,0],[3,1]])
df1.index = ['Argentina','Angola','Bolivia']
print(df1)
Goals Medals
Argentina 5 2
Angola 1 0
Bolivia 3 1
#DataFrame2
df2 = pd.DataFrame(columns=["Dates","Medals"],data=[[1,0],[2,1],[2,2])
df2.index = ['Venezuela','Africa']
print(df2)
Dates Medals
Venezuela 1 0
Africa 2 1
Argentina 2 2
#DataFrame3
df3 = pd.DataFrame(columns=["Players","Goals"],data=[[11,5],[11,1],[10,0]])
df3.index = ['Argentina','Australia','Belgica']
print(df3)
Players Goals
Argentina 11 5
Australia 11 1
Spain 10 0
#mergedDf
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
print(mergedDF)
Goals_X Medals_X Dates Medals_Y Players Goals_Y
Argentina 5 2 2 2 11 2
#Calculate number of lost entries by code
I tried to merge everything with outer join and then subtracting the mergedDf, but I don't know how to do this, can anyone help me?

I've found a simple but effective solution:
df1 = Df1()
df2 = Df2()
df3 = Df3()
inner = pd.merge(pd.merge(df1,df2,on='<Common column>',how='inner'),df3,on='<Common column>',how='inner')
outer = pd.merge(pd.merge(df1,df2,on='<Common column>',how='outer'),df3,on='<Common column>',how='outer')
return (len(outer)-len(inner))
Solution with outer join and parameter indicator, last count rows with no both in both indicator columns a and b by sum of True values (processes like 1s):
mergedDf = pd.merge(df1,df2,how='outer',left_index=True, right_index=True, indicator='a')
mergedDf = pd.merge(mergedDf,df3,how='outer',left_index=True, right_index=True, indicator='b')
print(mergedDf)
Goals_x Medals_x Dates Medals_y a Players Goals_y \
Africa NaN NaN 2.0 1.0 right_only NaN NaN
Angola 1.0 0.0 NaN NaN left_only NaN NaN
Argentina 5.0 2.0 2.0 2.0 both 11.0 5.0
Australia NaN NaN NaN NaN NaN 11.0 1.0
Belgica NaN NaN NaN NaN NaN 10.0 0.0
Bolivia 3.0 1.0 NaN NaN left_only NaN NaN
Venezuela NaN NaN 1.0 0.0 right_only NaN NaN
b
Africa left_only
Angola left_only
Argentina both
Australia right_only
Belgica right_only
Bolivia left_only
Venezuela left_only
missing = ((mergedDf['a'] != 'both') & (mergedDf['b'] != 'both')).sum()
print (missing)
6
Another solution is use inner join and sum filtered values of each index which not matched mergedDf.index:
mergedDf = pd.merge(df1,df2,how='inner',left_index=True, right_index=True)
mergedDf = pd.merge(mergedDf,df3,how='inner',left_index=True, right_index=True)
vals = mergedDf.index
print (vals)
Index(['Argentina'], dtype='object')
dfs = [df1, df2, df3]
missing = sum((~x.index.isin(vals)).sum() for x in dfs)
print (missing)
6
Anoter solution if unique values in each index:
dfs = [df1, df2, df3]
L = [set(x.index) for x in dfs]
#https://stackoverflow.com/a/25324329/2901002
missing = len(set.union(*L) - set.intersection(*L))
print (missing)
6
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With