As far I understood, on match failure, regex backtracks to the previous closest choice, chooses the next alternative and continues matching. Is it true that regex avoids backtracking, in case there is no possibility of a match ?
Here is a scenario, I am unable to explain to myself this regex101 debugger output.
Regex: ^(?<Major>0|[1-9][0-9]*)\.(?<Minor>0|[1-9][0-9]*)\.(?<Patch>0|[1-9][0-9]*)$
Input string: 1.222222222222.33333333333.0
Why does the above regex backtrack at the end, when there is no chance that $
will match an avoided [0-9]
? After the patch group is exhausted, will it avoid backtracking the ([1-9][0-9]*)
part for Minor and Major? If so, how and why?
If anyone has a better example to showcase how regex decides to backtrack or not, it would be appreciated.
Edit 1: why did Regex assumed that $ could be found by backtracking for [0-9]* ?
is this dependent on type of regex engine? if yes what should be best practice to avoid this kind of backtracking.
In theory, this is how the match occurs:
00000000001111111111222222222
01234567890123456789012345678
1.222222222222.33333333333.0
^
matches 0 chars at position 0.
0
doesn't match at position 0. ⇒ Backtrack[1-9]
matches 1 char at position 0.
[0-9]*
matches 0 chars at position 1.
\.
matches 1 chars at position 1.
0
doesn't match at position 2. ⇒ Backtrack[1-9]
matches 1 char at position 2.
[0-9]*
matches 11 chars at position 3.
\.
matches 1 chars at position 14.
0
doesn't match at position 15. ⇒ Backtrack[1-9]
matches 1 char at position 15.
[0-9]*
matches 10 chars at position 16.
$
doesn't match at position 26. ⇒ Backtrack[0-9]*
matches 9 chars at position 16.
$
doesn't match at position 25. ⇒ Backtrack[0-9]*
matches 1 char at position 16.
$
doesn't match at position 17. ⇒ Backtrack[0-9]*
matches 0 chars at position 16.
$
doesn't match at position 16. ⇒ Backtrack[0-9]*
matches 10 chars at position 3.
\.
doesn't match at position 13. ⇒ Backtrack[0-9]*
matches 9 chars at position 3.
\.
doesn't match at position 12. ⇒ Backtrack[0-9]*
matches 1 char at position 3.
\.
doesn't match at position 4. ⇒ Backtrack[0-9]*
matches 0 chars at position 3.
\.
doesn't match at position 3. ⇒ Backtrack^
doesn't match at position 1. ⇒ Backtrack^
doesn't match at position 2. ⇒ Backtrack^
doesn't match at position 27. ⇒ Backtrack^
doesn't match at position 28. ⇒ BacktrackIn practice, there can be optimizations as long as the result is the same.
For example, an optimizer could analyze the pattern and see that the string must contains exactly two .
to match, and start the matching process by counting the number of .
in the string.
Perl's engine doesn't go that far — the cost of performing such an analysis could eclipse the gains — but it does look for a .
before even starting to match.
$ perl -Mre=debug -e'
"1.222222222222.33333333333.0" =~
/^(?<Major>0|[1-9][0-9]*)\.(?<Minor>0|[1-9][0-9]*)\.(?<Patch>0|[1-9][0-9]*)$/
'
…
Matching REx "^(?<Major>0|[1-9][0-9]*)\.(?<Minor>0|[1-9][0-9]*)\.(?<Patch>"... against "1.222222222222.33333333333.0"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [1..25] gave 1
Found floating substr "." at offset 1 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <1.22222222> | 0| 1:SBOL /^/(2)
0 <> <1.22222222> | 0| 2:OPEN1 'Major'(4)
0 <> <1.22222222> | 0| 4:BRANCH (buf:1/1)(8)
0 <> <1.22222222> | 1| 6:EXACT <0>(14)
| 1| failed...
0 <> <1.22222222> | 0| 8:BRANCH (buf:1/1)(14)
0 <> <1.22222222> | 1| 10:ANYOFR[1-9](12)
1 <1> <.222222222> | 1| 12:STAR(14)
| 1| POSIXA[\d] can match 0 times out of 2147483647...
1 <1> <.222222222> | 2| 14:CLOSE1 'Major'(16)
1 <1> <.222222222> | 2| 16:EXACT <.>(18)
2 <1.> <2222222222> | 2| 18:OPEN2 'Minor'(20)
2 <1.> <2222222222> | 2| 20:BRANCH (buf:2/2)(24)
2 <1.> <2222222222> | 3| 22:EXACT <0>(30)
| 3| failed...
2 <1.> <2222222222> | 2| 24:BRANCH (buf:2/2)(30)
2 <1.> <2222222222> | 3| 26:ANYOFR[1-9](28)
3 <1.2> <2222222222> | 3| 28:STAR(30)
| 3| POSIXA[\d] can match 11 times out of 2147483647...
14 <22222> <.333333333> | 4| 30:CLOSE2 'Minor'(32)
14 <22222> <.333333333> | 4| 32:EXACT <.>(34)
15 <2222.> <3333333333> | 4| 34:OPEN3 'Patch'(36)
15 <2222.> <3333333333> | 4| 36:BRANCH (buf:3/3)(40)
15 <2222.> <3333333333> | 5| 38:EXACT <0>(46)
| 5| failed...
15 <2222.> <3333333333> | 4| 40:BRANCH (buf:3/3)(46)
15 <2222.> <3333333333> | 5| 42:ANYOFR[1-9](44)
16 <222.3> <3333333333> | 5| 44:STAR(46)
| 5| POSIXA[\d] can match 10 times out of 2147483647...
26 <3333333333> <.0> | 6| 46:CLOSE3 'Patch'(48)
26 <3333333333> <.0> | 6| 48:SEOL(49)
| 6| failed...
25 <333333333> <3.0> | 6| 46:CLOSE3 'Patch'(48)
25 <333333333> <3.0> | 6| 48:SEOL(49)
| 6| failed...
24 <33333333> <33.0> | 6| 46:CLOSE3 'Patch'(48)
24 <33333333> <33.0> | 6| 48:SEOL(49)
| 6| failed...
23 <3333333> <333.0> | 6| 46:CLOSE3 'Patch'(48)
23 <3333333> <333.0> | 6| 48:SEOL(49)
| 6| failed...
22 <333333> <3333.0> | 6| 46:CLOSE3 'Patch'(48)
22 <333333> <3333.0> | 6| 48:SEOL(49)
| 6| failed...
21 <33333> <33333.0> | 6| 46:CLOSE3 'Patch'(48)
21 <33333> <33333.0> | 6| 48:SEOL(49)
| 6| failed...
20 <33333> <333333.0> | 6| 46:CLOSE3 'Patch'(48)
20 <33333> <333333.0> | 6| 48:SEOL(49)
| 6| failed...
19 <.3333> <3333333.0> | 6| 46:CLOSE3 'Patch'(48)
19 <.3333> <3333333.0> | 6| 48:SEOL(49)
| 6| failed...
18 <2.333> <33333333.0> | 6| 46:CLOSE3 'Patch'(48)
18 <2.333> <33333333.0> | 6| 48:SEOL(49)
| 6| failed...
17 <22.33> <333333333.> | 6| 46:CLOSE3 'Patch'(48)
17 <22.33> <333333333.> | 6| 48:SEOL(49)
| 6| failed...
16 <222.3> <3333333333> | 6| 46:CLOSE3 'Patch'(48)
16 <222.3> <3333333333> | 6| 48:SEOL(49)
| 6| failed...
| 5| failed...
| 4| BRANCH failed...
| 3| failed...
| 2| BRANCH failed...
| 1| failed...
| 0| BRANCH failed...
Match failed
…
Also note how it results in less backtracking than the unoptimized approach above.
(Note that you used PCRE2 and not Perl on regex101.)
Is it true that regex avoids backtracking, in case there is no possibility of a match ?
The depends entirely on the engine, the string and the pattern.
why did Regex assumed that $ could be found by backtracking for [0-9]* ?
That question makes no sense. You actually want to know the following:
why didn't the Regex realize that $ couldn't be found by backtracking for [0-9]* ?
Still an invalid question. Failing to match doesn't backtrack through [0-9]*
. It backtracks through (?<Patch>A|B)
. If it was just [0-9]*
before $
, Perl's engine would actually be smart enough to skip the backtracking.
$ perl -Mre=debug -e'
"1.222222222222.33333333333.0" =~
/^(?<Major>0|[1-9][0-9]*)\.(?<Minor>0|[1-9][0-9]*)\.(?<Patch>0$|[1-9][0-9]*$)/
'
…
Matching REx "^(?<Major>0|[1-9][0-9]*)\.(?<Minor>0|[1-9][0-9]*)\.(?<Patch>"... against "1.222222222222.33333333333.0"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [1..25] gave 1
Found floating substr "." at offset 1 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <1.22222222> | 0| 1:SBOL /^/(2)
0 <> <1.22222222> | 0| 2:OPEN1 'Major'(4)
0 <> <1.22222222> | 0| 4:BRANCH (buf:1/1)(8)
0 <> <1.22222222> | 1| 6:EXACT <0>(14)
| 1| failed...
0 <> <1.22222222> | 0| 8:BRANCH (buf:1/1)(14)
0 <> <1.22222222> | 1| 10:ANYOFR[1-9](12)
1 <1> <.222222222> | 1| 12:STAR(14)
| 1| POSIXA[\d] can match 0 times out of 2147483647...
1 <1> <.222222222> | 2| 14:CLOSE1 'Major'(16)
1 <1> <.222222222> | 2| 16:EXACT <.>(18)
2 <1.> <2222222222> | 2| 18:OPEN2 'Minor'(20)
2 <1.> <2222222222> | 2| 20:BRANCH (buf:2/2)(24)
2 <1.> <2222222222> | 3| 22:EXACT <0>(30)
| 3| failed...
2 <1.> <2222222222> | 2| 24:BRANCH (buf:2/2)(30)
2 <1.> <2222222222> | 3| 26:ANYOFR[1-9](28)
3 <1.2> <2222222222> | 3| 28:STAR(30)
| 3| POSIXA[\d] can match 11 times out of 2147483647...
14 <22222> <.333333333> | 4| 30:CLOSE2 'Minor'(32)
14 <22222> <.333333333> | 4| 32:EXACT <.>(34)
15 <2222.> <3333333333> | 4| 34:OPEN3 'Patch'(36)
15 <2222.> <3333333333> | 4| 36:BRANCH (buf:3/3)(41)
15 <2222.> <3333333333> | 5| 38:EXACT <0>(40)
| 5| failed...
15 <2222.> <3333333333> | 4| 41:BRANCH (buf:3/3)(48)
15 <2222.> <3333333333> | 5| 43:ANYOFR[1-9](45)
16 <222.3> <3333333333> | 5| 45:STAR(47)
| 5| POSIXA[\d] can match 10 times out of 2147483647...
26 <3333333333> <.0> | 6| 47:SEOL(48)
| 6| failed...
| 5| failed...
| 4| BRANCH failed...
| 3| failed...
| 2| BRANCH failed...
| 1| failed...
| 0| BRANCH failed...
Match failed
…
is this dependent on type of regex engine?
Yes.
if yes what should be best practice to avoid this kind of backtracking.
Formulate the pattern so that it only backtracks on failure (except trivially to determine the correct alternate), and prevent backtracking. In Perl, preventing the backtracking can be achieved using the "possessive quantifier" (+
) or an "independent subexpression" ((?>PATTERN)
or (*atomic:PATTERN)
).
^
(?<Major> 0 | [1-9][0-9]*+ )
\.
(?<Minor> 0 | [1-9][0-9]*+ )
\.
(?<Patch> 0 | [1-9][0-9]*+ )
\n?+
\z
$ perl -Mre=debug -e'
"1.222222222222.33333333333.0" =~
/^(?<Major>0|[1-9][0-9]*+)\.(?<Minor>0|[1-9][0-9]*+)\.(?<Patch>0|[1-9][0-9]*+)\n?+\z/
'
…
Matching REx "^(?<Major>0|[1-9][0-9]*+)\.(?<Minor>0|[1-9][0-9]*+)\.(?<Patc"... against "1.222222222222.33333333333.0"
Intuit: trying to determine minimum start position...
doing 'check' fbm scan, [1..25] gave 1
Found floating substr "." at offset 1 (rx_origin now 0)...
(multiline anchor test skipped)
Intuit: Successfully guessed: match at offset 0
0 <> <1.22222222> | 0| 1:SBOL /^/(2)
0 <> <1.22222222> | 0| 2:OPEN1 'Major'(4)
0 <> <1.22222222> | 0| 4:BRANCH (buf:1/1)(8)
0 <> <1.22222222> | 1| 6:EXACT <0>(18)
| 1| failed...
0 <> <1.22222222> | 0| 8:BRANCH (buf:1/1)(18)
0 <> <1.22222222> | 1| 10:ANYOFR[1-9](12)
1 <1> <.222222222> | 1| 12:SUSPEND(18)
1 <1> <.222222222> | 2| 14:STAR(16)
| 2| POSIXA[\d] can match 0 times out of 2147483647...
1 <1> <.222222222> | 3| 16:SUCCEED(0)
| 3| SUCCEED: subpattern success...
1 <1> <.222222222> | 1| 18:CLOSE1 'Major'(20)
1 <1> <.222222222> | 1| 20:EXACT <.>(22)
2 <1.> <2222222222> | 1| 22:OPEN2 'Minor'(24)
2 <1.> <2222222222> | 1| 24:BRANCH (buf:2/2)(28)
2 <1.> <2222222222> | 2| 26:EXACT <0>(38)
| 2| failed...
2 <1.> <2222222222> | 1| 28:BRANCH (buf:2/2)(38)
2 <1.> <2222222222> | 2| 30:ANYOFR[1-9](32)
3 <1.2> <2222222222> | 2| 32:SUSPEND(38)
3 <1.2> <2222222222> | 3| 34:STAR(36)
| 3| POSIXA[\d] can match 11 times out of 2147483647...
14 <22222> <.333333333> | 4| 36:SUCCEED(0)
| 4| SUCCEED: subpattern success...
14 <22222> <.333333333> | 2| 38:CLOSE2 'Minor'(40)
14 <22222> <.333333333> | 2| 40:EXACT <.>(42)
15 <2222.> <3333333333> | 2| 42:OPEN3 'Patch'(44)
15 <2222.> <3333333333> | 2| 44:BRANCH (buf:3/3)(48)
15 <2222.> <3333333333> | 3| 46:EXACT <0>(58)
| 3| failed...
15 <2222.> <3333333333> | 2| 48:BRANCH (buf:3/3)(58)
15 <2222.> <3333333333> | 3| 50:ANYOFR[1-9](52)
16 <222.3> <3333333333> | 3| 52:SUSPEND(58)
16 <222.3> <3333333333> | 4| 54:STAR(56)
| 4| POSIXA[\d] can match 10 times out of 2147483647...
26 <3333333333> <.0> | 5| 56:SUCCEED(0)
| 5| SUCCEED: subpattern success...
26 <3333333333> <.0> | 3| 58:CLOSE3 'Patch'(60)
26 <3333333333> <.0> | 3| 60:SUSPEND(70)
26 <3333333333> <.0> | 4| 62:CURLY{0,1}(68)
| 4| EXACT <\n> can match 0 times out of 1...
26 <3333333333> <.0> | 5| 68:SUCCEED(0)
| 5| SUCCEED: subpattern success...
26 <3333333333> <.0> | 3| 70:EOS(71)
| 3| failed...
| 2| BRANCH failed...
| 1| BRANCH failed...
| 0| BRANCH failed...
Match failed
…
Backtracking is sometimes needed, but most backtracking can be eliminated in this manner.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With