I would like to know how to match postal addresses when their format differ or when one of them is mispelled.
So far I've found different solutions but I think that they are quite old and not very efficient. I'm sure some better methods exist, so if you have references for me to read, I'm sure it is a subject that may interest several persons.
The solutions I found (examples are in R) :
Levenshtein distance, which equals the number of characters you have to insert, delete or change to transform one word into another.
agrep("acusait", c("accusait", "abusait"), max = 2, value = TRUE)
## [1] "accusait" "abusait"
The comparison of phonemes
library(RecordLinkage)
soundex(x<-c('accusait','acusait','abusait'))
## [1] "A223" "A223" "A123"
The use of a spelling corrector (eventually a bayesian one like Peter Norvig's), but not very efficient on addresses I guess.
I thougt about using the suggestions of Google suggest, but likewise, it is not very efficient on personal postal addresses.
You can imagine using a machine learning supervised approach but you need to have stored the mispelled requests of users to do so which is not an option for me.
I look at this as a spelling-correction problem, where you need to find the nearest-matching word in some sort of dictionary. What I mean by "near" is Levenshtein distance, except the smallest number of single-character insertions, deletions, and replacements is too restrictive. Other kinds of "spelling mistakes" are also possible, for example transposing two characters.
I've done this a few times, but not lately. The most recent case had to do with concomitant medications for clinical trials. You would be amazed how many ways there are to misspell "acetylsalicylic".
Here is an outline in C++ of how it is done.
Briefly, the dictionary is stored as a trie, and you are presented with a possibly misspelled word, which you try to look up in the trie. As you search, you try the word as it is, and you try all possible alterations of the word at each point. As you go, you have an integer budget of how may alterations you can tolerate, which you decrement every time you put in an alteration. If you exhaust the budget, you allow no further alterations.
Now there is a top-level loop, where you call the search. On the first iteration, you call it with a budget of 0. When the budget is 0, it will allow no alterations, so it is simply a direct lookup. If it fails to find the word with a budget of 0, you call it again with a budget of 1, so it will allow a single alteration. If that fails, try a budget of 2, and so on.
Something I have not tried is a fractional budget. For example, suppose a typical alteration reduces the budget by 2, not 1, and the budget goes 0, 2, 4, etc. Then some alterations could be considered "cheaper". For example, a vowel replacement might only decrement the budget by 1, so for the cost of one consonant replacement you could do two vowel replacements.
If the word is not misspelled, the time it takes is proportional to the number of letters in the word. In general, the time it takes is exponential in the number of alterations in the word.
If you are working in R (as I was in the example above), I would have it call out to the C++ program, because you need the speed of a compiled language for this.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With