While building a simple regex I've found it to have a quite strange performance profile while input size increased.
Here is another really basic regex that has a similar behavior:
a+b
I've profiled it with a simple benchmark:
Regex regex = new Regex("a+b", RegexOptions.Compiled);
const int maxInputSize = 100;
const int n = 1000;
string input = "";
Stopwatch stopwatch = new Stopwatch();
for (int inputSize = 1; inputSize <= maxInputSize; ++inputSize)
{
input += 'a';
stopwatch.Restart();
for (int i = 0; i < n; ++i)
{
regex.Match(input);
}
stopwatch.Stop();
Console.WriteLine(stopwatch.Elapsed.Ticks);
}
It runs the regex on the strings "a", "aa", "aaa", ... and measure the time it took for each length of the string to do n matches.
I'm aware of the backtracking issue (e.g. if the regex was something like (a+a+)+b), but in this case even considering backtracking I expected a linear complexity.
As an example if we want to match n times 'a' here is the workflow I naively expected:
take first 'a'
take second 'a'
...
take last 'a'
ooops nothing more to take => backtracking
release one 'a' and try to match 'b', nothing => backtracking
...
release second 'a' and retry to match 'b', nothing => backtracking
release first 'a'
ooops we're back at the beginning => no match
So it should execute something like 2n operations.
(This documentation seems to confirm that complexity should be linear: http://msdn.microsoft.com/en-us/library/dsy130b4.aspx)
But instead I observe a quadratic complexity:

So my questions are:
Thanks in advance for any input.
The Regex.Match function searches for a substring match: the engine tries to match the expression starting at any index of the string, giving an O(n²) algorithm. You can probably achieve linear performance by anchoring the regex to the start of the string:
Regex regex = new Regex("^a+b$", RegexOptions.Compiled);
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With