I'm interested in a function of two word lists, which would return an order agnostic edit distance between them.
That is, the arguments would be two lists of (let's say space delimited) words and return value would be the minimum sum of the edit (or Levenshtein) distances of the words in the lists.
Distance between "cat rat bat" and "rat bat cat" would be 0. Distance between "cat rat bat" and "fat had bad" would be the same as distance between "rat bat cat" and "had fat bad", 4. In the case the number of words in the lists are not the same, the shorter list would be padded with 0-length words.
My intuition (which hasn't been nurtured with computer science classes) does not find any other solution than to use brute force:
|had|fat|bad| a solution
---+---+---+---+ +---+---+---+
cat| 2 | 1 | 2 | | | 1 | |
---+---+---+---+ +---+---+---+
rat| 2 | 1 | 2 | | 3 | | |
---+---+---+---+ +---+---+---+
bat| 2 | 1 | 1 | | | | 4 |
---+---+---+---+ +---+---+---+
Starting from the first row, pick a column and go to the next rows without ever revisiting a column you have already visited. Do this over and over again until you've tried all combinations.
To me this sounds a bit like the traveling salesman problem. Is it, and how would you solve my particular problem?
As you already showed in the grid on the left, you can start by calculating the edit distances for every pair of words. This is easily done in polynomial time (n^2 edit distance calculations).
Then your problem can be described as a "minimum weighted bipartite matching", or equivalently, a "maximum weighted bipartite matching". This can also be done in polynomial time (faster than traveling salesman). See http://en.wikipedia.org/wiki/Matching_%28graph_theory%29#Maximum_matchings_in_bipartite_graphs
This looks like a perfect oportunity to break out the Levenshtein distance algorithm. This algorithm will do exactly what you are looking for (the minimal distance between two strings) and it is pretty efficient too. As far as it being a variation of traveling salesman that would be a no because this can be solved in polynomial time due to the structure of the problem. Hope this helps.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With