The objective is to combine two df row wise, if a predetermine condition is met.
Specifically, if the difference between the column is less or equal than a threshold, then, join the row of the df.
Given two df: df1 and df2, the following code partially achieve the objective.
import pandas as pd
df1 = pd.DataFrame ( {'time': [2, 3, 4, 24, 31]} )
df2 = pd.DataFrame ( {'time': [4.1, 24.7, 31.4, 5]} )
th = 0.9
all_comb=[]
for index, row in df1.iterrows ():
for index2, row2 in df2.iterrows ():
diff = abs ( row ['time'] - row2 ['time'] )
if diff <= th:
all_comb.append({'idx_1':index,'time_1':row ['time'], 'idx_2':index2,'time_2':row2 ['time']})
df_all = pd.DataFrame(all_comb)
outputted
idx_1 time_1 idx_2 time_2
0 2 4 0 4.1
1 3 24 1 24.7
2 4 31 2 31.4
However, the above approach ignore certain information i.e., the value of 2 and 3 from the df1, and the value of 5 from df2.
The expected output should be something like
idx_1 time_1 idx_2 time_2
0 2 NA NA
1 3 NA NA
2 4 0 4.1
3 24 1 24.7
4 31 2 31.4
NA NA 3 5
Appreciate for any hint or any way that more compact and efficient than the proposed above.
You can perform a cross merge and then subset all the rows at once based on your condition. Then we concat, adding back any rows that had no conditions met from both DataFrames.
import pandas as pd
df1 = df1.reset_index().add_suffix('_1')
df2 = df2.reset_index().add_suffix('_2')
m = df1.merge(df2, how='cross')
# Subset to all matches: |time_diff| <= thresh
th = 0.9
m = m[(m['time_1'] - m['time_2']).abs().le(th)]
# Add back rows with no matches
res = pd.concat([df1[~df1.index_1.isin(m.index_1)],
m,
df2[~df2.index_2.isin(m.index_2)]], ignore_index=True)
print(res)
index_1 time_1 index_2 time_2
0 0.0 2.0 NaN NaN
1 1.0 3.0 NaN NaN
2 2.0 4.0 0.0 4.1
3 3.0 24.0 1.0 24.7
4 4.0 31.0 2.0 31.4
5 NaN NaN 3.0 5.0
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With