I'm performing POS tagging with the Stanford POS Tagger. The tagger only returns one possible tagging for the input sentence. For instance, when provided with the input sentence "The clown weeps.", the POS tagger produces the (erroneous) "The_DT clown_NN weeps_NNS ._.".
However, my application will try to parse the result, and may reject a POS tagging because there is no way to parse it. Hence, in this example, it would reject "The_DT clown_NN weeps_NNS ._." but would accept "The_DT clown_NN weeps_VBZ ._." which I assume is a lower-confidence tagging for the parser.
I would therefore like the POS tagger to provide multiple hypotheses for the tagging of each word, annotated by some kind of confidence value. In this way, my application could choose the POS tagging with highest confidence that achieves a valid parsing for its purposes.
I have found no way to ask the Stanford POS Tagger to produce multiple (n-best) tagging hypotheses for each word (or even for the whole sentence). Is there a way to do this? (Alternatively, I am also OK with using another POS tagger with comparable performance that would have support for this.)
POS-tagging algorithms fall into two distinctive groups: rule-based and stochastic. E. Brill's tagger, one of the first and most widely used English POS-taggers, employs rule-based algorithms.
The POS tagging process is the process of finding the sequence of tags which is most likely to have generated a given word sequence. We can model this POS process by using a Hidden Markov Model (HMM), where tags are the hidden states that produced the observable output, i.e., the words.
OpenNLP allows getting n best for POS tagging:
Some applications need to retrieve the n-best pos tag sequences and not only the best sequence. The topKSequences method is capable of returning the top sequences. It can be called in a similar way as tag.
Sequence topSequences[] = tagger.topKSequences(sent);Each Sequence object contains one sequence. The sequence can be retrieved via Sequence.getOutcomes() which returns a tags array and Sequence.getProbs() returns the probability array for this sequence.
Also, there is also a way to make spaCy do something like this:
Doc.set_extension('tag_scores', default=None)
Token.set_extension('tag_scores', getter=lambda token: token.doc._.tag_scores[token.i])
class ProbabilityTagger(Tagger):
    def predict(self, docs):
        tokvecs = self.model.tok2vec(docs)
        scores = self.model.softmax(tokvecs)
        guesses = []
        for i, doc_scores in enumerate(scores):
            docs[i]._.tag_scores = doc_scores
            doc_guesses = doc_scores.argmax(axis=1)
            if not isinstance(doc_guesses, numpy.ndarray):
                doc_guesses = doc_guesses.get()
            guesses.append(doc_guesses)
        return guesses, tokvecs
Language.factories['tagger'] = lambda nlp, **cfg: ProbabilityTagger(nlp.vocab, **cfg)
Then each token will have tag_scores with the probabilities for each part of speech from spaCy's tag map.
Source: https://github.com/explosion/spaCy/issues/2087
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With