I am using Whoosh to index and search a large number of documents, and many of the things I need to search on are hyphenated. Whoosh seems to treat hyphens as a special character of some kind, but for the life of me I can't figure out it's behavior.
Can anyone advise on how Whoosh treats hyphens while indexing and searching?
Whoosh simply treats all punctuation as a space. Assuming a default AND search, the query dual-scale thermometer is equivalent to dual AND scale AND thermometer. This will find a document containing dual-scale digital thermometer, but it will also find dual purpose bathroom scale with thermometer.
One solution to avoid this is to turn the hyphenated words in your query into phrases: "dual-scale" thermometer, which is the equivalent of "dual scale" AND thermometer.
You could also force Whoosh to accept hyphens as part of a word. You do this by overriding the RegexTokenizer expression in the StandardAnalyzer with a regular expression that accepts hyphens as a valid part of a token.
from whoosh import fields, analysis
myanalyzer = analysis.StandardAnalyzer(expression=r'[\w-]+(\.?\w+)*')
schema = fields.Schema(myfield=fields.TEXT(analyzer=myanalyzer))
Now a search for dual-scale thermometer is equivalent to dual-scale AND thermometer and will find dual-scale digital thermometer but not "dual purpose bathroom scale with thermometer".
However, you won't be able to search for hyphenated words independently. If your document contained high-quality components, you would not match it if you searched for quality; only high-quality, because this has now become one token. Because of this side-effect, unless your content is strictly constrained in its use of hyphens to truly atomic hyphenated words, I would recommend the phrase approach.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With