This arose from a discussion on formalizing regular expressions syntax. I've seen this behavior with several regular expression parsers, hence I tagged it language-agnostic.
Take the following expression (adjust it for your favorite language):
replace("input", "(.*)*", "$1")
it will return an empty string. Why?
More curiously even, the expression replace("input", "(.*)*", "A$1B") will return the string ABAB. Why the double empty match?
Disclaimer: I know about backtracking and greedy matches, but the rules laid out by Jeffrey Friedl seem to dictate that .* matches everything and that no further backtracking or matching is done. Then why is $1 empty?
Note: compare with (.+)*, which returns the input string. However, http://regexhero.com shows that there are still two matches, which seems odd for the same reasons as above.
Let's see what happens:
(.*) matches "input"."input" is captured into group 1.(.*) is repeated, another match attempt is made:(.*) matches the empty string after "input".1, overwriting "input".$1 now contains the empty string.A good question from the comments:
Then why does
replace("input", "(input)*", "A$1B")return"AinputBAB"?
(input)* matches "input". It is replaced by "AinputB".(input)* matches the empty string. It is replaced by "AB" ($1 is empty because it didn't participate in the match)."AinputBAB"
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With