I'm using Instaparse to parse expressions like:
$(foo bar baz $(frob))
into something like:
[:expr "foo" "bar" "baz" [:expr "frob"]]
I've almost got it, but having trouble with ambiguity. Here's a simplified version of my grammar that repros, attempting to rely on negative lookahead.
(def simple
(insta/parser
"expr = <dollar> <lparen> word (<space> word)* <rparen>
<word> = !(dollar lparen) #'.+' !(rparen)
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
(simple "$(foo bar)")
which errors:
Parse error at line 1, column 11:
$(foo bar)
^
Expected one of:
")"
#"\s+"
Here I've said a word can be any char, in order to support expressions like:
$(foo () `bar` b-a-z)
etc. Note a word can contain () but it cannot contain $(). Not sure how to express this in the grammar. Seems the problem is <word> is too greedy, consuming the last ) instead of letting expr have it.
Update removed whitespace from word:
(def simple2
(insta/parser
"expr = <dollar> <lparen> word (<space> word)* <rparen>
<word> = !(dollar lparen) #'[^ ]+' !(rparen)
<space> = #'\\s+'
<dollar> = <'$'>
<lparen> = <'('>
<rparen> = <')'>"))
(simple2 "$(foo bar)")
; Parse error at line 1, column 11:
; $(foo bar)
; ^
; Expected one of:
; ")"
; #"\s+"
(simple2 "$(foo () bar)")
; Parse error at line 1, column 14:
; $(foo () bar)
; ^
; Expected one of:
; ")"
; #"\s+"
Update 2 more test cases
(simple2 "$(foo bar ())")
(simple2 "$((foo bar baz))")
Update 3 full working parser
For anyone curious, the full working parser, which was outside the scope of this question is:
(def parse
"expr - the top-level expression made up of cmds and sub-exprs. When multiple
cmds are present, it implies they should be sucessively piped.
cmd - a single command consisting of words.
sub-expr - a backticked or $(..)-style sub-expression to be evaluated inline.
parened - a grouping of words wrapped in parenthesis, explicitly tokenized to
allow parenthesis in cmds and disambiguate between sub-expression
syntax."
(insta/parser
"expr = cmd (<space> <pipe> <space> cmd)*
cmd = words
<sub-expr> = <backtick> expr <backtick> | nestable-sub-expr
<nestable-sub-expr> = <dollar> <lparen> expr <rparen>
words = word (<space>* word)*
<word> = sub-expr | parened | word-chars
<word-chars> = #'[^ `$()|]+'
parened = lparen words rparen
<space> = #'[ ]+'
<pipe> = #'[|]'
<dollar> = <'$'>
<lparen> = '('
<rparen> = ')'
<backtick> = <'`'>"))
Example usage:
(parse "foo bar (qux) $(clj (map (partial * $(js 45 * 2)) (range 10))) `frob`")
Parses to:
[:expr [:cmd [:words "foo" "bar" [:parened "(" [:words "qux"] ")"] [:expr [:cmd [:words "clj" [:parened "(" [:words "map" [:parened "(" [:words "partial" "*" [:expr [:cmd [:words "js" "45" "*" "2"]]]] ")"] [:parened "(" [:words "range" "10"] ")"]] ")"]]]] [:expr [:cmd [:words "frob"]]]]]]
This is a parser for a chatbot I wrote, yetibot. It replaces the previous mess of regex-based, by-hand parsing.
I don't really know instaparser, so I just read enough documentation to give me a false sense of security. I also didn't test, and I don't really know what your requirements are.
In particular, I don't know:
1) Whether $() can nest (your grammar makes that impossible, I think, but it seems odd to me)
2) Whether () can contain whitespace without being parsed as more than one word
3) Whether () can contain $()
You'll need to be clear on things like this in order to write the grammar (or, as it happens, to ask for advice).
Update: Revised the grammar based on comments. I removed the productions for $ ( and ) because they seemed unnecessary, and this way the angle-brackets feel easier to deal with.
The following is based on answering the above questions "yes, no, yes" and some random assumptions about regex format. (I'm not totally clear on how angle-brackets work, but I don't think it will be easy to make parentheses output the way you want; I settled for just outputting them as single elements. If I figure out something, I'll edit it.)
<sequence> = element (<space> element)*
<element> = expr | paren_sequence | word
expr = <'$'> <'('> sequence <')'>
<word> = !('$'? '(') #'([^ $()]|\$[^(])+'
<paren_sequence> = '(' sequence ')'
<space> = #'\\s+'
Hope that helps a bit.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With