Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

NLTK RegexpParser, chunk phrase by matching exactly one item

I'm using NLTK's RegexpParser to chunk a noun phrase, which I define with a grammar as

 grammar = "NP: {<DT>?<JJ>*<NN|NNS>+}"
 cp = RegexpParser(grammar)

This is grand, it is matching a noun phrase as:

  • DT if it exists
  • JJ in whatever number
  • NN or NNS, at least one

Now, what if I want to match the same but having the whatever number for JJ transformed into only one? So I want to match DT if it exists, one JJ and 1+ NN/NNS. If there are more than one JJ, I want to match only one of them, the one nearest to the noun (and DT if there is, and NN/NNS).

The grammar

grammar = "NP: {<DT>?<JJ><NN|NNS>+}"

would match only when there is just one JJ, the grammar

grammar = "NP: {<DT>?<JJ>{1}<NN|NNS>+}"

which I thought would work given the typical Regexp patterns, raises a ValueError.

For example, in "This beautiful green skirt", I'd like to chunk "This green skirt".

So, how would I proceed?

like image 675
mar tin Avatar asked Dec 06 '25 07:12

mar tin


1 Answers

Grammer grammar = "NP: {<DT>?<JJ><NN|NNS>+}" is correct for your mentioned requirement.

The example which you gave in comment section, where you are not getting DT in output -

"This beautiful green skirt is for you."

Tree('S', [('This', 'DT'), ('beautiful', 'JJ'), Tree('NP', [('green','JJ'), 
('skirt', 'NN')]), ('is', 'VBZ'), ('for', 'IN'), ('you', 'PRP'), ('.', '.')])

Here in your example, there are 2 consecutive JJs which does not meet your requirements as you said - "I want to match DT if it exists, one JJ and 1+ NN/NNS."


For updated requirement - I want to match DT if it exists, one JJ and 1+ NN/NNS. If there are more than one JJ, I want to match only one of them, the one nearest to the noun (and DT if there is, and NN/NNS).

Here, you will need to use

grammar = "NP: {<DT>?<JJ>*<NN|NNS>+}"

and do post processing of the NP chunks to remove extra JJ.

Code:

from nltk import Tree

chunk_output = Tree('S', [Tree('NP', [('This', 'DT'), ('beautiful', 'JJ'), ('green','JJ'), ('skirt', 'NN')]), ('is', 'VBZ'), ('for', 'IN'), ('you', 'PRP'), ('.', '.')])

for child in chunk_output:
    if isinstance(child, Tree):               
        if child.label() == 'NP':
            for num in range(len(child)):
                if not (child[num][1]=='JJ' and child[num+1][1]=='JJ'):
                    print child[num][0]

Output:

This
green
skirt
like image 116
RAVI Avatar answered Dec 07 '25 20:12

RAVI



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!