Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Lucene Scoring Function - bias towards shorter documents

I want Lucene Scoring function to have no bias based on the length of the document. This is really a follow up question to Calculate the score only based on the documents have more occurance of term in lucene

I was wondering how Field.setOmitNorms(true) works? I see that there are two factors that make short documents get a high score:

  1. "boost" that shorter length posts - using doc.getBoost()
  2. "lengthNorm" in the definition of norm(t,d)

Here is the documentation

I was wondering - if I wanted no bias towards shorter documents, is Field.setOmitNorms(true) enough?

like image 226
vir Avatar asked Jan 26 '26 19:01

vir


1 Answers

Using BM25Similarity you could reduce to 0f:

@param b Controls to what degree document length normalizes tf values

or

@param k1 Controls non-linear term frequency normalization (saturation).

Both params will affect SimWeight

indexSearcher.setSimilarity(new BM25Similarity(1.2f,0f));

More explanation can be found here : http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

like image 106
Guillaume Malartre Avatar answered Jan 28 '26 11:01

Guillaume Malartre



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!