I have a dataframe with start_dtm
and end_dtm
.
I need to flag all rows that have a [start_dtm,end_dtm] overlapping with any other row, and for each of the rows that have an overlap, also the index of the overlapping rows.
I do that with a for loop, but that is extremely inefficient.
def do_overlap(b,a):
return a['start_dtm']<=b['end_dtm'] and a['end_dtm']>=b['start_dtm']
for _,r1 in df.iterrows():
for _,r2 in df.iterrows():
do_overlap(r1,r2)
Sample data:
start_dtm end_dtm
1 2018-02-01 2018-02-01
8 2018-02-14 2018-02-15
9 2018-03-01 2018-03-01
10 2018-04-13 2018-04-13
11 2018-04-20 2018-04-20
... ... ...
944676 2018-08-09 2018-08-10
944677 2018-09-06 2018-09-11
944678 2018-10-16 2018-10-17
944679 2018-10-31 2018-11-02
944680 2018-11-28 2018-11-29
test data
df = pd.DataFrame({
"start":[0,3,4,8,9],
"end":[4,6,7,9,10],
})
solution
melt the dataframe and sort
melted=df.reset_index().melt(id_vars="index").sort_values(["value", "variable"])
melted
looks like this
index variable value
0 0 start 0
1 1 start 3
5 0 end 4
2 2 start 4
6 1 end 6
7 2 end 7
3 3 start 8
8 3 end 9
4 4 start 9
9 4 end 10
The index
column corresponds to original row number, variable
column indicates the type of endpoint and value
column is the value of the endpoint
Now take a look at this calculation
(pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose()
the result is
0 1 2 3 4
0 1 0 0 0 0
1 0 1 0 0 0
5 -1 0 0 0 0
2 0 0 1 0 0
6 0 -1 0 0 0
7 0 0 -1 0 0
3 0 0 0 1 0
8 0 0 0 -1 0
4 0 0 0 0 1
9 0 0 0 0 -1
Each column corresponds to an interval (row in the original dataframe). Values of 1 correspond to a start point, -1 corresponds to an endpoint. Combine with a cumulative sum and we get a 0-1 matrix where the 1s correspond to the span of the interval.
interval_matrix=(pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose().cumsum()
interval_matrix
looks like this
0 1 2 3 4
0 1 0 0 0 0
1 1 1 0 0 0
5 0 1 0 0 0
2 0 1 1 0 0
6 0 0 1 0 0
7 0 0 0 0 0
3 0 0 0 1 0
8 0 0 0 0 0
4 0 0 0 0 1
9 0 0 0 0 0
Now two intervals overlap if, and only if, one of the start points is contained within the other interval. We can use this to get subset this dataframe by start points only. If we imagine each interval as a node in a graph with edges between intervals that overlap then the next piece of code is calculating directed edges.
start_overlaps = interval_matrix.loc[melted["variable"] == "start"]
start_overlaps
looks like this
0 1 2 3 4
0 1 0 0 0 0
1 1 1 0 0 0
2 0 1 1 0 0
3 0 0 0 1 0
4 0 0 0 0 1
Lastly add this to its transpose to 'make the arrows bidirectional' and convert to boolean
overlaps = (start_overlaps + start_overlaps.transpose()) > 0
overlaps
looks like this
0 1 2 3 4
0 True True False False False
1 True True True False False
2 False True True False False
3 False False False True False
4 False False False False True
summary
melted=df.reset_index().melt(id_vars="index").sort_values(["value", "variable"])
interval_matrix=(pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose().cumsum()
start_overlaps = interval_matrix.loc[melted["variable"] == "start"]
overlaps = (start_overlaps + start_overlaps.transpose()) > 0
A slightly faster solution with numpy
(np)
melted=df.melt(ignore_index=False).sort_values(["value", "variable"])
interval_matrix=np.multiply(pd.get_dummies(melted.index).values, np.expand_dims(melted["variable"].map({"start":1, "end":-1}).values,1)).cumsum(axis=0)
start_overlaps = interval_matrix[melted["variable"] == "start", :]
pd.DataFrame(start_overlaps + start_overlaps.transpose() > 0)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With