I am studying the Okapi BMS25 model. I understand everything but two confusion. While calculating document length (dl) and average document length (avdl). I found the document length is

So it is a summation of my keywords/terms in a particular document. But when I see wiki's def:

So |D| is the length of the document D in words (i.e. is summation of total words count). Now, the question what is dl actually?
Now, second question how to calculate avdl? (just calculating (doc1+doc2+...N)/N where N is my total no documents in collection? (and avdl is fixed for whole collection?)
According the Joaquín Pérez-Iglesias in Integrating the Probabilistic Model BM25/BM25F into Lucene, the score function R should be defined as followed :

such as
occurs_t^d is the term frequency of t in d,l_d is the document d length.avl_d is the document average length along the collectionk_1 is a free parameter usually 2 and b in [0,1] (usually 0.75). Assigning 0 to b is equivalent to avoid the process of normalisation and therefore the document length will not affect the final score.
If b takes 1, we will be carrying out a full length normalisation.

where N is the number of document in the collection and df is the number of documents where appears the term t.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With