Efficiently find overlap between dates in pandas

Question

I have a dataframe with start_dtm and end_dtm.

I need to flag all rows that have a [start_dtm,end_dtm] overlapping with any other row, and for each of the rows that have an overlap, also the index of the overlapping rows.

I do that with a for loop, but that is extremely inefficient.

def do_overlap(b,a):
    return a['start_dtm']<=b['end_dtm'] and a['end_dtm']>=b['start_dtm']

for _,r1 in df.iterrows():
    for _,r2 in df.iterrows():
          do_overlap(r1,r2)

Sample data:

      start_dtm     end_dtm
1   2018-02-01  2018-02-01
8   2018-02-14  2018-02-15
9   2018-03-01  2018-03-01
10  2018-04-13  2018-04-13
11  2018-04-20  2018-04-20
... ... ...
944676  2018-08-09  2018-08-10
944677  2018-09-06  2018-09-11
944678  2018-10-16  2018-10-17
944679  2018-10-31  2018-11-02
944680  2018-11-28  2018-11-29

Riley · Accepted Answer

test data

df = pd.DataFrame({
    "start":[0,3,4,8,9],
    "end":[4,6,7,9,10],
})

solution

melt the dataframe and sort

 melted=df.reset_index().melt(id_vars="index").sort_values(["value", "variable"])

melted looks like this

   index variable  value
0      0    start      0
1      1    start      3
5      0      end      4
2      2    start      4
6      1      end      6
7      2      end      7
3      3    start      8
8      3      end      9
4      4    start      9
9      4      end     10

The index column corresponds to original row number, variable column indicates the type of endpoint and value column is the value of the endpoint

Now take a look at this calculation

    (pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose()

the result is

   0  1  2  3  4
0  1  0  0  0  0
1  0  1  0  0  0
5 -1  0  0  0  0
2  0  0  1  0  0
6  0 -1  0  0  0
7  0  0 -1  0  0
3  0  0  0  1  0
8  0  0  0 -1  0
4  0  0  0  0  1
9  0  0  0  0 -1

Each column corresponds to an interval (row in the original dataframe). Values of 1 correspond to a start point, -1 corresponds to an endpoint. Combine with a cumulative sum and we get a 0-1 matrix where the 1s correspond to the span of the interval.

interval_matrix=(pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose().cumsum()

interval_matrix looks like this

     0  1  2  3  4
  0  1  0  0  0  0
  1  1  1  0  0  0
  5  0  1  0  0  0
  2  0  1  1  0  0
  6  0  0  1  0  0
  7  0  0  0  0  0
  3  0  0  0  1  0
  8  0  0  0  0  0
  4  0  0  0  0  1
  9  0  0  0  0  0

Now two intervals overlap if, and only if, one of the start points is contained within the other interval. We can use this to get subset this dataframe by start points only. If we imagine each interval as a node in a graph with edges between intervals that overlap then the next piece of code is calculating directed edges.

start_overlaps = interval_matrix.loc[melted["variable"] == "start"]

start_overlaps looks like this

   0  1  2  3  4
0  1  0  0  0  0
1  1  1  0  0  0
2  0  1  1  0  0
3  0  0  0  1  0
4  0  0  0  0  1

Lastly add this to its transpose to 'make the arrows bidirectional' and convert to boolean

overlaps = (start_overlaps + start_overlaps.transpose()) > 0

overlaps looks like this

       0      1      2      3      4
0   True   True  False  False  False
1   True   True   True  False  False
2  False   True   True  False  False
3  False  False  False   True  False
4  False  False  False  False   True

summary

melted=df.reset_index().melt(id_vars="index").sort_values(["value", "variable"])
interval_matrix=(pd.get_dummies(melted["index"]).transpose()*melted["variable"].map({"start":1, "end":-1})).transpose().cumsum()
start_overlaps = interval_matrix.loc[melted["variable"] == "start"]
overlaps = (start_overlaps + start_overlaps.transpose()) > 0

A slightly faster solution with numpy (np)

melted=df.melt(ignore_index=False).sort_values(["value", "variable"])
interval_matrix=np.multiply(pd.get_dummies(melted.index).values, np.expand_dims(melted["variable"].map({"start":1, "end":-1}).values,1)).cumsum(axis=0)
start_overlaps = interval_matrix[melted["variable"] == "start", :]
pd.DataFrame(start_overlaps + start_overlaps.transpose() > 0)

Efficiently find overlap between dates in pandas

Tags:

python

date

pandas

overlap

000000

1 Answers

Riley

Recent Activity

Donate For Us

Efficiently find overlap between dates in pandas

Tags:

python

date

pandas

overlap

00__00__00

1 Answers

Riley

Related questions

Recent Activity

Donate For Us

000000