prefix similarity search

Question

I am trying to find a way to build a fuzzy search where both the text database and the queries may have spelling variants. In particular, the text database is material collected from the web and likely would not benefit from full text engine's prep phase (word stemming) I could imagine using pg_trgm as a starting point and then validate hits by Levenshtein. However, people tend to do prefix queries E.g, in the realm of music, I would expect "beetho symphony" to be a reasonable search term. So, is someone were typing "betho symphony", is there a reasonable way (using postgresql with perhaps tcl or perl scripting) to discover that the "betho" part should be compared with "beetho" (returning an edit distance of 1)

user1938139 · Accepted Answer

What I ended up is a simple modification of the common algorithm: normally I would just pick up the last value from the matrix or vector pair. Referring to the "iterative" algorithm in http://en.wikipedia.org/wiki/Levenshtein_distance I put the strings to be probed as first argument and the query string as second one. Now, when the algorithm finishes, the minimum value in the result column gives the proper result

Sample results: query "fantas", words in database "fantasy", "fantastic" => 0 query "fantas", wor in database "fan" => 3

The input to edit distance are words selected from a "most words" list based on trigram similarity

Sample results: query "fantas", words in database "fantasy", "fantastic" => 0 query "fantas", wor in database "fan" => 3

The input to edit distance are words selected from a "most words" list based on trigram similarity

prefix similarity search

Tags:

algorithm

postgresql

fuzzy-search

user1938139

1 Answers

user1938139

Recent Activity

Donate For Us

prefix similarity search

Tags:

algorithm

postgresql

fuzzy-search

user1938139

1 Answers

user1938139

Related questions

Recent Activity

Donate For Us