.Net simple regex quadratic complexity

Question

While building a simple regex I've found it to have a quite strange performance profile while input size increased.

Here is another really basic regex that has a similar behavior:

a+b

I've profiled it with a simple benchmark:

Regex regex = new Regex("a+b", RegexOptions.Compiled);

const int maxInputSize = 100;
const int n = 1000;

string input = "";
Stopwatch stopwatch = new Stopwatch();
for (int inputSize = 1; inputSize <= maxInputSize; ++inputSize)
{
    input += 'a';

    stopwatch.Restart();
    for (int i = 0; i < n; ++i)
    {
        regex.Match(input);
    }
    stopwatch.Stop();

    Console.WriteLine(stopwatch.Elapsed.Ticks);
}

It runs the regex on the strings "a", "aa", "aaa", ... and measure the time it took for each length of the string to do n matches.

I'm aware of the backtracking issue (e.g. if the regex was something like (a+a+)+b), but in this case even considering backtracking I expected a linear complexity.

As an example if we want to match n times 'a' here is the workflow I naively expected:

take first 'a'
take second 'a'
...
take last 'a'
ooops nothing more to take => backtracking
release one 'a' and try to match 'b', nothing => backtracking
...
release second 'a' and retry to match 'b', nothing => backtracking
release first 'a'
ooops we're back at the beginning => no match

So it should execute something like 2n operations.

(This documentation seems to confirm that complexity should be linear: http://msdn.microsoft.com/en-us/library/dsy130b4.aspx)

But instead I observe a quadratic complexity:

enter image description here

So my questions are:

1) were my expectations about linear complexity unreasonable?
2) if yes, what am I missing about regex matching?
3) if no, is my benchmark flawed? why?
4) if my expectations and the benchmark are correct what could be the issue(s)?

Thanks in advance for any input.

Joni · Accepted Answer

The Regex.Match function searches for a substring match: the engine tries to match the expression starting at any index of the string, giving an O(n²) algorithm. You can probably achieve linear performance by anchoring the regex to the start of the string:

Regex regex = new Regex("^a+b$", RegexOptions.Compiled);

.Net simple regex quadratic complexity

Tags:

c#

.net

regex

time-complexity

backtracking

Pragmateek

1 Answers

Joni

Recent Activity

Donate For Us

.Net simple regex quadratic complexity

Tags:

c#

.net

regex

time-complexity

backtracking

Pragmateek

1 Answers

Joni

Related questions

Recent Activity

Donate For Us