I'm trying to use NTLK grammar and parse algorithms as they seem pretty simple to use. Though, I can't find a way to match an alphanumeric string properly, something like:
import nltk
grammar = nltk.parse_cfg ("""
# Is this possible?
TEXT -> \w*  
""")
parser = nltk.RecursiveDescentParser(grammar)
print parser.parse("foo")
Is there an easy way to achieve this?
It would be very difficult to do cleanly. The base parser classes rely on exact matches or the production RHS to pop content, so it would require subclassing and rewriting large parts of the parser class. I attempted it a while ago with the feature grammar class and gave up.
What I did instead is more of a hack, but basically, I extract the regex matches from the text first, and add them to the grammar as productions. It will be very slow if you are using a large grammar since it needs to recompute the grammar and parser for every call.
import re
import nltk
from nltk.grammar import Nonterminal, Production, ContextFreeGrammar
grammar = nltk.parse_cfg ("""
S -> TEXT
TEXT -> WORD | WORD TEXT | NUMBER | NUMBER TEXT
""")
productions = grammar.productions()
def literal_production(key, rhs):
    """ Return a production <key> -> n 
    :param key: symbol for lhs:
    :param rhs: string literal:
    """
    lhs = Nonterminal(key)
    return Production(lhs, [rhs])
def parse(text):
    """ Parse some text.
"""
    # extract new words and numbers
    words = set([match.group(0) for match in re.finditer(r"[a-zA-Z]+", text)])
    numbers = set([match.group(0) for match in re.finditer(r"\d+", text)])
    # Make a local copy of productions
    lproductions = list(productions)
    # Add a production for every words and number
    lproductions.extend([literal_production("WORD", word) for word in words])
    lproductions.extend([literal_production("NUMBER", number) for number in numbers])
    # Make a local copy of the grammar with extra productions
    lgrammar = ContextFreeGrammar(grammar.start(), lproductions)
    # Load grammar into a parser
    parser = nltk.RecursiveDescentParser(lgrammar)
    tokens = text.split()
    return parser.parse(tokens)
print parse("foo hello world 123 foo")
Here's more background where this was discussed on the nltk-users group on google groups: https://groups.google.com/d/topic/nltk-users/4nC6J7DJcOc/discussion
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With