I'm using Elasticsearch 5.3.1 and I'm evaluating BM25 and Classic TF/IDF.
I came across the discount_overlaps property which is optional.
Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.
Can someone explain what the above means with an example if possible.
First off, the norm is calculated as boost / √length, and this value is stored at index time. This causes matches on shorter fields to get a higher score (because 1 in 10 is generally a better match than 1 in 1000).
For an example, let's say we have a synonym filter on our analyzer, that is going index a bunch of synonyms in the indexed form of our field. Then we index this text:
The man threw a frisbee
Once the analyzer adds all the synonyms to the field, it looks like this:

Now when we search for "The dude pitched a disc", we'll get a match.
The question is, for the purposes the norm calculation above, what is the length?
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With