This is a graph theory problem on a real dataset. I have a dataset of roughly 100M points, of camera sightings of cars over time in a city. They can be thought of as { camera ID, license plate ID, timestamp } tuples roughly. There are around 2000 cameras around the cities, we know their locations, and around 10M unique cars (license plates) throughout the one-month period for which we have data. I am trying to determine how many unique paths there are, empirically, between arbitrary cameras (nodes). A 'path' here is just consecutive sightings of the same license plate by different cameras close in time.
The problem is that 1/3 of all camera sightings are missing: most cameras have 80% identification accuracy, and a few have 0-10%. The main question is: how do we tell whether two paths are really different, and not just one underlying path with some missing sightings?
For instance, consider some high volume route A->C. If there is just one way to get from A->C, say via B, then you will still see a lot of cars at A then directly at C because camera B just happened not to pick them up (approx. 1/3 of the total traffic will be AC just because of this missing data points fact), and so AC will appear to be a popular path as well as ABC, when really ABC is the only real path people take from A->C. The goal is to tell these paths apart so that we can count the number of unique paths A->C at the end of the day to study robustness of the traffic network to closures of roads.
One approach is to compare average times taken through routes, but two different routes through space can take very similar amounts of time. Here's how I solved the problem for a specific case: consider two routes from A to C: ABC and AB'C, where there is no edge B->B'. We know that this cannot in fact be one underlying route ABB'C because then we'd see an edge B->B', which we don't. I wonder if this heuristic can be generalized (but of course we still want to tell paths apart even if BB' was an edge, since in that case we are not guaranteed that ABB'C is the only true path taken).
I'm also open to using external APIs like Google Maps for more information but not quite sure how that would make the problem easier. I am aiming to solve this problem for arbitrarily long paths between two arbitrarily far cameras, but I think most such questions can be reduced to a version of the ABC vs AC problem outlined above.
Here is a pictorial example of an instance of the problem. Bottom left can be thought of as node A and top right as C. Then it becomes an AC vs ABC vs ABDC problem. The data shows that AC was taken 19 times, ABC 46 times and ABDC 86 times. The question is to figure out how many of these three paths is a "real path" that people travel through. Of course, on a map it is easier to count by hand based on the road network, but the question is doing this algorithmically based on the data at hand for all points without plotting.

Here is an example of the data. We split the data into groups that represent real car trips. So sightings of the same license plate by different cameras sufficiently close in time (on the same day of course). Here is an example of one group (unrelated to picture above), like which there are about 10M others.
index camera_ID encoded_plate date time_seconds velocity
9200 480301111 660.0 2021-03-11 62000.0 54
9201 480321111 660.0 2021-03-11 62135.0 28
9202 480331111 660.0 2021-03-11 62235.0 5
9203 480341112 660.0 2021-03-11 62302.0 42
9204 450371112 660.0 2021-03-11 62648.0 32
Here's a possible approach: sort the recorded journey types by number of camera points, descending (so e.g. ABCD has four points, it should appear before EFG which has three). Then for each journey type, estimate how many spurious observations of each subsequence you would expect to see, and subtract that number from the actual observations of each of those subsequences (e.g. if we think there were 100 jouneys for ABCD, then we'd expect 100 * p(1-p)^3 spurious observations of each of ABC, ABD, ACD and BCD, and so on for shorter subsequences).
Also keep track of the uncertainty in each count, which is initially zero, but each time you subtract an estimate from the count of some journey type, add on the variance of that estimate to the uncertainty for that journey type. The variance can then be used to estimate the likelihood that a measurement should really be zero; if so, just ignore that journey type when you come to it, and move onto the next one.
At the end, the "real" journey types are those where a count of zero isn't likely based on the estimated count and variance. Don't forget to also adjust the counts of "real" journeys upwards, e.g. a path ABCD has a (1-p)^4 chance of not being recorded correctly so you should divide the estimated count by that factor to compensate. This adjustment should be made before estimating how many spurious observations to deduct from the subsequences.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With