Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Variable-length lookbehind compiles if repetition is in parentheses. Why?

Tags:

regex

php

pcre

Problem

PHP uses the PCRE regex library, which does not support repetition in lookbehinds.

If repetition is in the lookbehind (e.g., (?<=\d+)), PHP will normally issue a warning like this:

Warning: preg_match_all(): Compilation failed: lookbehind assertion is not fixed length at offset 7 in lookbehind.php on line 10

However, I have found a case where compilation does not fail when I think it should.

These fail to compile, as expected:

  • /(?<=X*)a/
  • /(?<=X+)a/
  • /(?<=(X)*)a/

However, /(?<=(X)+)a/ does compile. This should be functionally equivalent to /(?<=(X){1,})a/, which also compiles. On the other hand, if I actually add an upper bound to that range
(e.g., /(?<=(X){1,2})a/), that fails to compile. I think /(?<=(X)+)a/ and /(?<=(X){1,})a/ should also fail to compile, but they do not. Why not?

Experimentation

Here's some code:

$str = 'aXaaXXaaaXXXaaaa';

$regex = '/(?<=((?:X)+))a+/';

preg_match_all($regex, $str, $matches, PREG_OFFSET_CAPTURE|PREG_SET_ORDER);
print_r($matches);

I've complicated the pattern slightly to add a capturing group around the multiple Xs. Here are my results:

Array (
    [0] => Array (
            [0] => Array (
                    [0] => aa
                    [1] => 2
                )
            [1] => Array (
                    [0] => X
                    [1] => 1
                )
        )
    [1] => Array (
            [0] => Array (
                    [0] => aaa
                    [1] => 6
                )
            [1] => Array (
                    [0] => X
                    [1] => 5
                )
        )
    [2] => Array (
            [0] => Array (
                    [0] => aaaa
                    [1] => 12
                )
            [1] => Array (
                    [0] => X
                    [1] => 11
                )
        )
)

It clearly matches the as that follow Xs, which is correct. However, subpattern 1 appears to only match one X, not all of them. If I add an a at the beginning of the lookbehind so that it must find all the Xs in between, here are my results:

$regex = '/(?<=(a(?:X)+))a+/';
Array (
    [0] => Array (
            [0] => Array (
                    [0] => aa
                    [1] => 2
                )
            [1] => Array (
                    [0] => aX
                    [1] => 0
                )
        )
)

It only matches once (where there is only one X). Effectively, (X)+ and (X){1,} are being reduced to (X){1} (which is allowable due to its fixed length).

Conclusion

I hate to cry, "Bug!" as soon as I find something that doesn't do what I expect, but it sure seems like one. The pattern isn't rejected like I expect, and then it doesn't behave as I would expect it to even if it were a valid pattern.

So I ask:

  • Is there a valid reason why it should behave this way?
  • Why does this apply to + but not *?
  • Why do parentheses matter: X+ fails; (X)+ is allowed ?

Any insight is most appreciated. Thank you.

like image 456
Wiseguy Avatar asked Jan 28 '26 18:01

Wiseguy


2 Answers

It's not a PHP bug. If it is a bug (and it does look like one) it is a PCRE bug and should be reported there. However, check the PCRE version in phpinfo() and compare it with the latest version. If it is not up-to-date try running the same regexes directly in the latest PCRE before posting a bug report.

like image 182
CJ Dennis Avatar answered Jan 30 '26 11:01

CJ Dennis


PCRE version 8.32-RC1 2012-08-08

re> /(?<=(X)+)a/ Failed: lookbehind assertion is not fixed length at offset 8 re>

Probably was a bug. Please update to the latest PCRE.

Btw, you can use \K to create unlimited backreferences.

like image 30
dark100 Avatar answered Jan 30 '26 10:01

dark100



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!