Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

prefix similarity search

I am trying to find a way to build a fuzzy search where both the text database and the queries may have spelling variants. In particular, the text database is material collected from the web and likely would not benefit from full text engine's prep phase (word stemming) I could imagine using pg_trgm as a starting point and then validate hits by Levenshtein. However, people tend to do prefix queries E.g, in the realm of music, I would expect "beetho symphony" to be a reasonable search term. So, is someone were typing "betho symphony", is there a reasonable way (using postgresql with perhaps tcl or perl scripting) to discover that the "betho" part should be compared with "beetho" (returning an edit distance of 1)

like image 335
user1938139 Avatar asked Nov 29 '25 04:11

user1938139


1 Answers

What I ended up is a simple modification of the common algorithm: normally I would just pick up the last value from the matrix or vector pair. Referring to the "iterative" algorithm in http://en.wikipedia.org/wiki/Levenshtein_distance I put the strings to be probed as first argument and the query string as second one. Now, when the algorithm finishes, the minimum value in the result column gives the proper result

Sample results: query "fantas", words in database "fantasy", "fantastic" => 0 query "fantas", wor in database "fan" => 3

The input to edit distance are words selected from a "most words" list based on trigram similarity

like image 104
user1938139 Avatar answered Nov 30 '25 23:11

user1938139



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!