Repeated elements in regexes of literals with alternation in Perl 6

Question

To explain what I mean, I'll give an example from German. Here is a sample grammar that can parse several Present tense verbal forms.

grammar Verb {
    token TOP {
        <base>
        <ending>
    }
    token base {
        geh   |
        spiel |
        mach
    }
    token ending {
        e     |  # 1sg
        st    |  # 2sg
        t     |  # 3sg
        en    |  # 1pl
        t     |  # 2pl
        en       # 3pl
    }
}

my @verbs = <gehe spielst machen>;
for @verbs -> $verb {
  my $match = Verb.parse($verb);
  say $match;
}

Endings for 1pl and 3pl (en) are the same, but for the sake of clarity it's more convenient to put them both into the token (in my real-life data inflexional paradigms are much more complex, and it's easy to get lost). The token ending works as expected, but I understand that if I put en only once, the program would work a bit faster (I've made tests with regexes consisting of many such repeated elements, and yes, the performance suffers greatly). With my data, there are lots of such repetitions, so I wonder what is the best way to treat them?

Of course, I could put the endings in an array outside the grammar, make this array .unique and then just pass the values:

my @endings = < ... >;
@endings .= unique;
...
token ending { @endings }

But taking data out of the grammar will make it less clear. Also, in some cases it might be necessary to make each ending a separate token (token ending {<ending_1sg> | <ending_2sg> ... <ending_3pl>}, which would be impossible if they are defined outside the grammar.

piojo · Accepted Answer

If I understand you, for the sake of clarity, you want to repeat regex terms with a comment that describes which notes it's a separate concept? Just comment the line out.

By the way, since empty regexes are ignored in this case, it's okay to begin the line with your branch operator, instead of putting it at the end. It makes things easier, especially when you need to add and remove lines. So I suggest something like this:

grammar Verb {
    # ...
    token ending {
        | e       # 1sg
        | st      # 2sg
        | t       # 3sg
        | en      # 1pl
        #| t       # 2pl
        #| en      # 3pl
    }
}

Because what you're writing is exclusively for the human, not for the parser. Now, if you wanted to use the different regexes to go into different parse matches so you could access the ending as either $<_3sg> or $<_2p1> (named regexes so both would succeed), you can't comment it out, and you're gonna have to force the computer to do the extra work. And obviously you'll need to use :exhaustive or :overlap. Instead, I would suggest you make a named regex that represents both 3sg and 2p1, and define it like I did above: write them both but comment one out.

Repeated elements in regexes of literals with alternation in Perl 6

Tags:

raku

Eugene Barsky

1 Answers

piojo

Recent Activity

Donate For Us

Repeated elements in regexes of literals with alternation in Perl 6

Tags:

raku

Eugene Barsky

1 Answers

piojo

Related questions

Recent Activity

Donate For Us