What is the best way to treat repetitions in regexes like abc | cde | abc | cde | cde | abc or <regex1> | <regex2> | <regex3> | <regex4> | <regex5> | <regex6>, where many of regexN will be the same literals?
To explain what I mean, I'll give an example from German. Here is a sample grammar that can parse several Present tense verbal forms.
grammar Verb {
token TOP {
<base>
<ending>
}
token base {
geh |
spiel |
mach
}
token ending {
e | # 1sg
st | # 2sg
t | # 3sg
en | # 1pl
t | # 2pl
en # 3pl
}
}
my @verbs = <gehe spielst machen>;
for @verbs -> $verb {
my $match = Verb.parse($verb);
say $match;
}
Endings for 1pl and 3pl (en) are the same, but for the sake of clarity it's more convenient to put them both into the token (in my real-life data inflexional paradigms are much more complex, and it's easy to get lost). The token ending works as expected, but I understand that if I put en only once, the program would work a bit faster (I've made tests with regexes consisting of many such repeated elements, and yes, the performance suffers greatly). With my data, there are lots of such repetitions, so I wonder what is the best way to treat them?
Of course, I could put the endings in an array outside the grammar, make this array .unique and then just pass the values:
my @endings = < ... >;
@endings .= unique;
...
token ending { @endings }
But taking data out of the grammar will make it less clear. Also, in some cases it might be necessary to make each ending a separate token (token ending {<ending_1sg> | <ending_2sg> ... <ending_3pl>}, which would be impossible if they are defined outside the grammar.
If I understand you, for the sake of clarity, you want to repeat regex terms with a comment that describes which notes it's a separate concept? Just comment the line out.
By the way, since empty regexes are ignored in this case, it's okay to begin the line with your branch operator, instead of putting it at the end. It makes things easier, especially when you need to add and remove lines. So I suggest something like this:
grammar Verb {
# ...
token ending {
| e # 1sg
| st # 2sg
| t # 3sg
| en # 1pl
#| t # 2pl
#| en # 3pl
}
}
Because what you're writing is exclusively for the human, not for the parser. Now, if you wanted to use the different regexes to go into different parse matches so you could access the ending as either $<_3sg> or $<_2p1> (named regexes so both would succeed), you can't comment it out, and you're gonna have to force the computer to do the extra work. And obviously you'll need to use :exhaustive or :overlap. Instead, I would suggest you make a named regex that represents both 3sg and 2p1, and define it like I did above: write them both but comment one out.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With