beside the correct language ID langid.py returns a certain value - "The value returned is a score for the language. It is not a probability esimate, as it is not normalized by the document probability since this is unnecessary for classification." But what does the value mean??
I'm actually the author of langid.py. Unfortunately, I've only just spotted this question now, almost a year after it was asked. I've tidied up the handling of the normalization since this question was asked, so all the README examples have been updated to show actual probabilities.
The value that you see there (and that you can still get by turning normalization off) is the un-normalized log-probability of the document. Because log/exp are monotonic, we don't actually need to compute the probability to decide the most likely class. The actual value of this log-prob is not actually of any use to the user. I should probably have never included it, and I may remove its output in the future.
I think this is the important chunk of langid.py code:
def nb_classify(fv):
# compute the log-factorial of each element of the vector
logfv = logfac(fv).astype(float)
# compute the probability of the document given each class
pdc = np.dot(fv,nb_ptc) - logfv.sum()
# compute the probability of the document in each class
pd = pdc + nb_pc
# select the most likely class
cl = np.argmax(pd)
# turn the pd into a probability distribution
pd /= pd.sum()
return cl, pd[cl]
It looks to me that the author is calculating something like the multinomial log-posterior of the data for each of the possible languages. logfv calculates the logarithm of the denominator of the PMF (x_1!...x_k!). np.dot(fv,nb_ptc) calculates the
logarithm of the p_1^x_1...p_k^x_k term. So, pdc looks like the list of language conditional log-likelihoods (except that it's missing the n! term). nb_pc looks like the prior probabilities, so pd would be the log-posteriors. The normalization line, pd /= pd.sum() confuses me, since one usually normalizes probability-like values (not log-probability values); also, the examples in the documentation (('en', -55.106250761034801)) don't look like they've been normalized---maybe they were generated before the normalization line was added?
Anyway, the short answer is that this value, pd[cl] is a confidence score. My understanding based on the current code is that they should be values between 0 and 1/97 (since there are 97 languages), with a smaller value indicating higher confidence.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With