What exactly does the "returned value" in langid.py mean?

Question

beside the correct language ID langid.py returns a certain value - "The value returned is a score for the language. It is not a probability esimate, as it is not normalized by the document probability since this is unnecessary for classification." But what does the value mean??

saffsd · Accepted Answer

I'm actually the author of langid.py. Unfortunately, I've only just spotted this question now, almost a year after it was asked. I've tidied up the handling of the normalization since this question was asked, so all the README examples have been updated to show actual probabilities.

The value that you see there (and that you can still get by turning normalization off) is the un-normalized log-probability of the document. Because log/exp are monotonic, we don't actually need to compute the probability to decide the most likely class. The actual value of this log-prob is not actually of any use to the user. I should probably have never included it, and I may remove its output in the future.

jrennie · Answer

I think this is the important chunk of langid.py code:

def nb_classify(fv):
  # compute the log-factorial of each element of the vector
  logfv = logfac(fv).astype(float)
  # compute the probability of the document given each class
  pdc = np.dot(fv,nb_ptc) - logfv.sum()
  # compute the probability of the document in each class
  pd = pdc + nb_pc
  # select the most likely class
  cl = np.argmax(pd)
  # turn the pd into a probability distribution
  pd /= pd.sum()
  return cl, pd[cl]

It looks to me that the author is calculating something like the multinomial log-posterior of the data for each of the possible languages. logfv calculates the logarithm of the denominator of the PMF (x_1!...x_k!). np.dot(fv,nb_ptc) calculates the logarithm of the p_1^x_1...p_k^x_k term. So, pdc looks like the list of language conditional log-likelihoods (except that it's missing the n! term). nb_pc looks like the prior probabilities, so pd would be the log-posteriors. The normalization line, pd /= pd.sum() confuses me, since one usually normalizes probability-like values (not log-probability values); also, the examples in the documentation (('en', -55.106250761034801)) don't look like they've been normalized---maybe they were generated before the normalization line was added?

Anyway, the short answer is that this value, pd[cl] is a confidence score. My understanding based on the current code is that they should be values between 0 and 1/97 (since there are 97 languages), with a smaller value indicating higher confidence.

What exactly does the "returned value" in langid.py mean?

Tags:

python

Tyto

2 Answers

saffsd

jrennie

Recent Activity

Donate For Us

What exactly does the "returned value" in langid.py mean?

Tags:

python

Tyto

2 Answers

saffsd

jrennie

Related questions

Recent Activity

Donate For Us