Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Similarity of two sets of intervals

What sort of algorithm/solution could be used to indicate the similarity (overlap/precision/recall/...) of two sets of ranges.

I can think of (or find online) hundreds of similar problems but never exact, but surely this "wheel" must have been invented already...

Lets say that the input data is something like:

Real      [ ## ###  #     ] or [(1,2),(4,6),(9,10)]  
Predicted [ ## #          ] or [(1,2),(4,4)]  

Output should be ~50 %

Should I for example AND bitmaps, use interval trees or what? Is there a nice functional or simple to write algorithm? Any meaningful measure of similarity will do, and so will any reasonable input format.

Thank you.

(realistic length ~4000 with < 50 intervals in each set)


1 Answers

You could break the segments to individual points, and tag each point as real/predicted and start/end.

Then sort the points, go over the sorted list and keep track of overlaps.

You don't even need to track whether the interval was originally from Real or Predicted - you just need to keep track whether there's one or two intervals at each point.

Example:

Real      [(1,2),(4,6),(9,10)]  
Predicted [(1,2),(4,4)]  

Broken down into points and sorted (S for Start, E for End):

[(1,S),(1,S),(2,E),(2,E),(4,S),(4,S),(4,E),(6,E),(9,S),(10,E)]

Then going through the array - keep track of how many segments "are open" and count the length of total open and 2 segments open.

The result is 2 segments open/total open.

like image 96
Alex Libov Avatar answered Sep 09 '25 23:09

Alex Libov