Lucene Scoring Function - bias towards shorter documents

Question

I want Lucene Scoring function to have no bias based on the length of the document. This is really a follow up question to Calculate the score only based on the documents have more occurance of term in lucene

I was wondering how Field.setOmitNorms(true) works? I see that there are two factors that make short documents get a high score:

"boost" that shorter length posts - using doc.getBoost()
"lengthNorm" in the definition of norm(t,d)

Here is the documentation

I was wondering - if I wanted no bias towards shorter documents, is Field.setOmitNorms(true) enough?

Guillaume Malartre · Accepted Answer

Using BM25Similarity you could reduce to 0f:

@param b Controls to what degree document length normalizes tf values

or

@param k1 Controls non-linear term frequency normalization (saturation).

Both params will affect SimWeight

indexSearcher.setSimilarity(new BM25Similarity(1.2f,0f));

More explanation can be found here : http://opensourceconnections.com/blog/2015/10/16/bm25-the-next-generation-of-lucene-relevation/

Lucene Scoring Function - bias towards shorter documents

Tags:

java

apache

lucene

tf-idf

vir

1 Answers

Guillaume Malartre

Recent Activity

Donate For Us

Lucene Scoring Function - bias towards shorter documents

Tags:

java

apache

lucene

tf-idf

vir

1 Answers

Guillaume Malartre

Related questions

Recent Activity

Donate For Us