I am writing lexer rules for a custom description language using pyLR1 which shall include time literals like for example:
10h30m # meaning 10 hours + 30 minutes
5m30s # meaning 5 minutes + 30 seconds
10h20m15s # meaning 10 hours + 20 minutes + 15 seconds
15.6s # meaning 15.6 seconds
The order of specification for hour, minute and second parts shall be fixed to h, m, s. To specify this in detail, I want the following valid combinations hms, hm, h, ms, m and s (with numbers between the different segments of course).
As a bonus the regex should check for decimal (i.e. non-natural) numbers in the segments and only allow these in the segment with least significance.
So I have for all but the last group a number match like:
([0-9]+)
And for the last group even:
([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?) # to allow for .5 and 0.5 and 5.0 and 5
Going through all the combinations of h, m and s a cute little python script gives me the following regex:
(([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)h|([0-9]+)h([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)h([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)m|([0-9]+)m([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s|([0-9]*\.[0-9]+|[0-9]+(\.[0-9]*)?)s)
Obviously, this is a little bit of horror expression. Is there any way to simplify this? The answer must work with pythons re module and I will also accept answers which do not work with pyLR1 if its due to its restricted subset of regular expressions.
You can factorise your regular expression, using the notation h, m, s to denote each of the subregexes, the most basic version is:
h|hm|hms|ms|m|s
which is what you have currently. You can break this into:
(h|hm|hms)|(ms|m)|s
and then pulling out h from the first expression and m from the second we get (using (x|) == x?):
h(m|ms)?|ms?|s
Continuing on we get to
h(ms?)?|ms?|s
which is probably simpler (and probably the simplest).
Adding in the regex d to denote decimals (as in \.[0-9]+), this could be written as
h(d|m(d|sd?)?)?|m(d|sd?)?|sd?
(i.e. at each stage optionally have either decimals, or a continuation to the next of h m or s.)
This would result in something like (for just hours and minutes):
[0-9]+((\.[0-9]+)?h|h[0-9]+(\.[0-9]+)?m)|[0-9]+(\.[0-9]+)?m
Looking at this, it might not be possible to get into a form ameniable for pyLR1, so doing the parsing with decimals in every spot and then a secondary check might be the best way to do this.
the below representation should be understandable, I dont know the exact regex syntax you're using, so you have to "translate" to the valid syntax yourself.
your hours
[0-9]{1,2}h
your minutes
[0-9]{1,2}m
your seconds
[0-9]{1,2}(\.[0-9]{1,3})?s
you want all those in order, and able to omit any of those (wrap with ?)
([0-9]{1,2}h)?([0-9]{1,2}m)?([0-9]{1,2}(\.[0-9]{1,3})?s)?
this however matches things like: 10h30s
that is valid combinations are hms, hm, hs, h, ms, m and s
or iow, minutes can be ommited, but still have hours and seconds.
the other problem is if the empty string is given, it is matched, as all three ? make that valid. so you have to work around this somehow. hmm
looking at @dbaupp h(ms?)?|ms?|s you can take the above and match:
h: [0-9]{1,2}h
m: [0-9]{1,2}m
s: [0-9]{1,2}(\.[0-9]{1,3})?s
so you get to:
h(ms?)?: ([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?
ms? : [0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?
s : [0-9]{1,2}(\.[0-9]{1,3})?s
all those OR'd together give you a big but easy to break down regex:
([0-9]{1,2}h([0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?)?|[0-9]{1,2}m([0-9]{1,2}(\.[0-9]{1,3})?s)?|[0-9]{1,2}(\.[0-9]{1,3})?s
which get you away with both the empty string problem and the match of hs.
looking at @Donal Fellows comment on @dbaupp answer, I'll also do (h?m)?S|h?M|H
(h?m)?s: (([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s
h?m : ([0-9]{1,2}h)?[0-9]{1,2}m
h : [0-9]{1,2}h
and merged together, you end up with something smaller than the above:
(([0-9]{1,2}h)?[0-9]{1,2}m)?[0-9]{1,2}(\.[0-9]{1,3})?s|([0-9]{1,2}h)?[0-9]{1,2}m|[0-9]{1,2}h
now we have to find a way to match .xx demical representation
Here is a short Python expression that works:
(\d+h)?(\d+m)?(\d*\.\d+|\d+(\.\d*)?)(?(2)s|(?(1)m|[hms]))
Inspired by Cameron Martins answer based on conditionals.
(\d+h)? # optional int "h" (capture 1)
(\d+m)? # optional int "m" (capture 2)
(\d*\.\d+|\d+(\.\d*)?) # int or decimal
(?(2) # if "m" (capture 2) was matched:
s # "s"
| (?(1) # else if "h" (capture 1) was matched:
m # "m"
| # else (nothing matched):
[hms])) # any of the "h", "m" or "s"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With