Elasticsearch similarity discount_overlaps

Question

I'm using Elasticsearch 5.3.1 and I'm evaluating BM25 and Classic TF/IDF. I came across the discount_overlaps property which is optional.

Determines whether overlap tokens (Tokens with 0 position increment) are ignored when computing norm. By default this is true, meaning overlap tokens do not count when computing norms.

Can someone explain what the above means with an example if possible.

femtoRgon · Accepted Answer

First off, the norm is calculated as boost / √length, and this value is stored at index time. This causes matches on shorter fields to get a higher score (because 1 in 10 is generally a better match than 1 in 1000).

For an example, let's say we have a synonym filter on our analyzer, that is going index a bunch of synonyms in the indexed form of our field. Then we index this text:

The man threw a frisbee

Once the analyzer adds all the synonyms to the field, it looks like this: post-analysis graph

Now when we search for "The dude pitched a disc", we'll get a match.

The question is, for the purposes the norm calculation above, what is the length?

if discount_overlaps = false, then length = 12
if discount_overlaps = true, then length = 5

Elasticsearch similarity discount_overlaps

Tags:

lucene

elasticsearch

Alkis Kalogeris

1 Answers

femtoRgon

Recent Activity

Donate For Us

Elasticsearch similarity discount_overlaps

Tags:

lucene

elasticsearch

Alkis Kalogeris

1 Answers

femtoRgon

Related questions

Recent Activity

Donate For Us