Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Capturing groups and lookarounds

Tags:

python

regex

I want to know how capturing groups (or non capturing) are affecting lookarounds in Regex. Here are 2 example:

test (?:(?!<start).)+

test (?!<start).+

I would appreciate if anybody can explain how regex engine is interpreting both cases in details.

like image 838
Pablo Avatar asked Jan 23 '26 15:01

Pablo


1 Answers

  1. Look-arounds are zero-width. In that respect, it doesn't make much sense to place them on their own inside a capturing group, they don't capture anything more interesting than an empty string (much like \b vs. (\b). Edge cases involve back-referencing an optional group, but that isn't very interesting.
  2. Positive looharounds - (?=...) and (?<=...) - can capture groups. For example, /(?=(\b\w+\b))/ will result in positive empty matches, where each match has a non-empty group. For example, /(?<=(.))\1/ will match characters that follow identical characters.
  3. Negative looharounds - (?!...) and (?<!...) - cannot capture groups. That makes a lot of sense when you think about it, because the never match, but they can use capturing groups within them. For example, ^(?!.*(.).*\1).*$ will match a line that does not contain duplicated letters. Again, how \1 behaves, in that case, out of the group is not particularity interesting.

Now, to your example. The two patterns match different texts:

  1. (?:(?!<start).)+ - Check we are not after the text start, and then match all characters (of the line). Examples:

    1. Input "start1234end", matches the whole input - the start position isn't after the word "start".
    2. Input "before123startAfter" Suppose the previous match was "before123start" (on a different pattern the allows that), the next match cannot start here, and will skip one character: "fter".
  2. (?:(?!<start).)+ - Here, the lookbehind assertion is repeated for every character (for intuition: if a group (?:...)+ is a loop, the assertion is inside the loop). A character will not be matched if it is directly after the string start:

    1. Input "start1234end" - First match will be "start". The engine cannot match the next '1' (because it isn't a character that isn't after start), so the match stops. The next match will be "234end".
like image 140
Kobi Avatar answered Jan 26 '26 23:01

Kobi



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!