I'm using lark, an excellent python parsing library.
It provides an Earley and LALR(1) parser and is defined through a custom EBNF format. (EBNF stands for Extended Backus–Naur form).
Lowercase definitions are rules, uppercase definitions are terminals. Lark also provides a weight for uppercase definitions to prioritize the matching.
I'm trying to define a grammar but I'm stuck with a behavior I can't seem to balance.
I have some rules with unnamed literals (the strings or characters between double-quotes):
directives: directive+
directive: "@" NAME arguments ?
directive_definition: description? "directive" "@" NAME arguments? "on" directive_locations
directive_locations: "SCALAR" | "OBJECT" | "ENUM"
arguments: "(" argument+ ")"
argument: NAME ":" value
union_type_definition: description? "union" NAME directives? union_member_types?
union_member_types: "=" NAME ("|" NAME)*
description: STRING | LONG_STRING    
STRING: /("(?!"").*?(?<!\\)(\\\\)*?"|'(?!'').*?(?<!\\)(\\\\)*?')/i
LONG_STRING: /(""".*?(?<!\\)(\\\\)*?"""|'''.*?(?<!\\)(\\\\)*?''')/is
NAME.2: /[_A-Za-z][_0-9A-Za-z]*/
It works well for 99% of use case. But if, in my parsed language, I use a directive which is called directive, everything breaks:
union Foo @something(test: 42) = Bar | Baz   # This works
union Foo @directive(test: 42) = Bar | Baz   # This fails
Here, the directive string is matched on the unnamed literal in the directive_definition rule when it should match the NAME.2 terminal.
How can I balance / adjust this so there is no ambiguity possible for the LALR(1) parser ?
Author of Lark here.
This misinterpretation happens because "directive" can be two different tokens: The "directive" string, or NAME. By default, Lark's LALR lexer always chooses the more specific one, namely the string.
So how can we let the lexer know that @directive is a name, and not just two constant strings?
Solution 1 - Use the Contextual Lexer
What would probably help in this situation (it's hard to be sure without the full grammar), is to use the contextual lexer, instead of the standard LALR(1) lexer.
The contextual lexer can communicate to some degree with the parser, to figure out which terminal makes more sense at each point. This is an algorithm that is unique to Lark, and you can use it like this:
parser = Lark(grammar, parser="lalr", lexer="contextual")
(This lexer can do anything the standard lexer can do and more, so in future versions it might become the default lexer.)
Solution 2 - Prefix the terminal
If the contextual lexer doesn't solve your collision, a more "classic" solution to this situation would be to define a directive token, something like:
DIRECTIVE: "@" NAME
Unlike your directive rule, this leaves no ambiguity to the lexer. There is a clear distinction between a directive, and the "directive" string (or NAME terminal).
And if all else fails, you can always use the Earley parser, which at the price of performance, will work with any grammar you give it, regardless of how many collisions there might be.
Hope this helps!
Edit: I'd just like to point out the the contextual lexer is the default for LALR now, so it's enough to call:
parser = Lark(grammar, parser="lalr")
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With