suppose we want to match all one
(s) between <out>...</out>
in this text (option: dot matches all):
<out>hello!</out>
<nx1>home one</nx1>
<nx2>living</nx2>
<out>one text
text one continues
and at last here ends one</out>
<m2>dog one</m2>
<out>bye!</out>
let's say we use this pattern:
one(?=(?:(?!<out>).)*</out>)
I really appreciate it if someone explains how regex engine process that pattern step-by-step and where it would be(position in the original text) in every phase of processing;(something like accepted @Tim Pietzcker's helpful explanation for this question: Regex - lookahead assertion)
Many tools exist to automatically explain what your regex does, character by character.
The idea behind it, though, is that you want to check one
is followed by </out>
while forbidding to enter a new out
tag: if there's a ...</out>
following and we haven't entered a new <out>...</out>
structure, we know we are inside one already.
So the regex will match one
if it is followed by </out>
and if there's no <out>
between the two.
The work is done by (?:(?!<out>).)*
: the .
matches only if it is not the first <
in <out>
. So we can go up to </out>
only by consuming characters that are not this <
followed by out>
.
A speed improvement would be:
one(?=(?:[^<]*+|<(?!out>))*+</out>)
Stepping inside the negative lookahead at each character greatly increases the cost of matching this character. Here [^<]*+
will match directly up to the next suspicious <
, and we perform the negative look ahead check only when we have to.
Here's the explanation taken from here:
NODE EXPLANATION
--------------------------------------------------------------------------------
one 'one'
--------------------------------------------------------------------------------
(?= look ahead to see if there is:
--------------------------------------------------------------------------------
(?: group, but do not capture (0 or more
times (matching the most amount
possible)):
--------------------------------------------------------------------------------
(?! look ahead to see if there is not:
--------------------------------------------------------------------------------
<out> '<out>'
--------------------------------------------------------------------------------
) end of look-ahead
--------------------------------------------------------------------------------
. any character except \n
--------------------------------------------------------------------------------
)* end of grouping
--------------------------------------------------------------------------------
</out> '</out>'
--------------------------------------------------------------------------------
) end of look-ahead
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With