I need to parse a string like this:
text0<%code0%>text1<%code1%><%%>text3
into two arrays. Every chunk is optional, so it can be just text or <%code%> or an empty string.
Taking out code is easy (if I'm not mistaken): <%(.*?)%>, but I need help with text as it has no such markers, unlike code.
Thanks!
Since regular expression matches must be contiguous (i.e. have no gaps) there is no single expression to match all text outside the tags. However, you can still do it if you combine regex with C#'s string facilities, like this:
var outside = string.Join("", Regex.Split(inputString, "<%.*?%>"));
If the inside of the tag may not contain percentage characters, you can optimize your regex to avoid backtracking by using this expression instead:
<%[^%]*%>
This very simple Regex will do :-) :-) (this is irony... The regex is correct, but it's absolutely unreadable, and even an expert in regexes would probably need at least 10 minutes to comprehend it fully)
var rx = new Regex("((?<1>((?!<%).)+)|<%(?<2>((?!%>).)*)%>)*", RegexOptions.ExplicitCapture);
var res2 = rx.Match("text0<%code0%>text1<%code1%><%%>text3");
string[] text = res2.Groups[1].Captures.Cast<Capture>().Select(p => p.Value).ToArray();
string[] escapes = res2.Groups[2].Captures.Cast<Capture>().Select(p => p.Value).ToArray();
remember that it requires the RegexOptions.ExplicitCapture.
The regex will capture in two Groups (1 and 2) the pieces of the string outside of <% %> and inside <% %>. Each group is composed of multiple Captures.
The explanation:
( ... )* The outer layer. Any number of captures are possible... So any number of "outside" and "inside" are possible
(?<1>((?!<%).)+) The capturing group 1, for the "outside"
| alternatively
<% An uncaptured <%
(?<2>((?!%>).)*) The capturing group 2, for the "inside"
%> An uncaptured %>
The capturing group 1:
(?<1> ... ) The name of the group (1)
and inside:
((?!<%).)+ Any character that isn't a < followed by a % (at least one character)
The capturing group 2:
(?<2> ... ) The name of the group (2)
and inside:
((?!%>).)* Any character that isn't a < followed by a % (can be empty)
Note that this regex will break badly if there is an unclosed <%!!! The problem is fixable.
var rx = new Regex("((?<1>((?!<%).)+)|<%(?<2>((?!<%|%>).)*)%>|(?<3><%.*))*", RegexOptions.ExplicitCapture);
and add
string[] errors = res2.Groups[3].Captures.Cast<Capture>().Select(p => p.Value).ToArray();
If errors isn't empty, there is an unclosed <%.
Now, if you want to have the captures sorted:
var captures = res2.Groups[1].Captures.Cast<Capture>().Select(p => new { Text = true, Index = p.Index, p.Value })
.Concat(res2.Groups[2].Captures.Cast<Capture>().Select(p => new { Text = false, Index = p.Index, p.Value }))
.OrderBy(p => p.Index)
.ToArray();
every capture now has an Index, a Text that can be true for Text and false for Escape and a Value that is the text of the Capture.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With