Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Using a grammar to parse a string without lookahead?

Tags:

grammar

raku

Got this text:

Want this || Not this

The line may also look like this:

Want this | Not this

with a single pipe.

I'm using this grammar to parse it:

    grammar HC {
       token TOP {  <pre> <divider> <post> }
       token pre { \N*? <?before <divider>> }
       token divider { <[|]> ** 1..2 } 
       token post { \N* }
    } 

Is there a better way to do this? I'd love to be able to do something more like this:

    grammar HC {
       token TOP {  <pre> <divider> <post> }
       token pre { \N*? }
       token divider { <[|]> ** 1..2 }
       token post { \N* }
    } 

But this does not work. And if I do this:

    grammar HC {
       token TOP {  <pre>* <divider> <post> }
       token pre { \N }
       token divider { <[|]> ** 1..2 } }
       token post { \N* }
    } 

Each character before divider gets its own <pre> capture. Thanks.

like image 712
StevieD Avatar asked Oct 17 '25 13:10

StevieD


2 Answers

As always, TIMTOWTDI.

I'd love to be able to do something more like this

You can. Just switch the first two rule declarations from token to regex:

grammar HC {
  regex TOP {  <pre> <divider> <post> }
  regex pre { \N*? }
  token divider { <[|]> ** 1..2 }
  token post { \N* }
} 

This works because regex disables :ratchet (unlike token and rule which enable it).

(Explaining why you need to switch it off for both rules is beyond my paygrade, certainly for tonight, and quite possibly till someone else explains why to me so I can pretend I knew all along.)

if I do this ... each character gets its own <pre> capture

By default, "calling a named regex installs a named capture with the same name" [... couple sentences later:] "If no capture is desired, a leading dot or ampersand will suppress it". So change <pre> to <.pre>.

Next, you can manually add a named capture by wrapping a pattern in $<name>=[pattern]. So to capture the whole string matched by consecutive calls of the pre rule, wrap the non-capturing pattern (<.pre>*?) in $<pre>=[...]):

grammar HC {
       token TOP { $<pre>=[<.pre>*?] <divider> <post> }
       token pre { \N }
       token divider { <[|]> ** 1..2 }
       token post { \N* }
    } 
like image 109
raiph Avatar answered Oct 21 '25 04:10

raiph


OK - I tried use Grammar::Tracer; (our best friend!) and got this from your original and the first answer with regexes ... both seemed wrong to me...

TOP
|  pre
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * FAIL
|  |  divider
|  |  * MATCH "|"
|  * MATCH "Want this "
|  divider
|  * MATCH "|"
|  post
|  * MATCH " Not this"
* MATCH "Want this | Not this"
「Want this | Not this」
 pre => 「Want this 」
 divider => 「|」
 post => 「 Not this」

This gives me the feeling that your combination of pre and divider are not converging. So I altered the code to this (with a more definitive definition of pre)...

  1 use Grammar::Tracer;
  2 
  3 grammar HC {
  4        token TOP {  <pre> <divider> <post> }
  5        token pre {  <-[|]>* }
  6        token divider { <[|]> ** 1..2 }
  7        token post { \N* }
  8 }  

and got this...

TOP
|  pre
|  * MATCH "Want this "
|  divider
|  * MATCH "|"
|  post
|  * MATCH " Not this"
* MATCH "Want this | Not this"
「Want this | Not this」
 pre => 「Want this 」
 divider => 「|」
 post => 「 Not this」

Sooo - I conclude that (i) using Grammar::Tracer to inspect the operation of Grammars is a must do, (ii) a loose definition like the original requires the parser to test on every char boundary should be avoided, (iii) especially if the divider is hard to pin down

I have the wider feeling that a Grammar (parser) may not be well suited to the underlying raw data structure and that a set of regexes may be a better approach.

I failed to work out how to use <.ws> or equivalent to trim the empty spaces from the captured results.

like image 21
p6steve Avatar answered Oct 21 '25 04:10

p6steve



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!