Which has better performance when matching a string: testing against the regex /regex1|regex2|regex3/, or test one pattern after the other?

Question

Suppose I have a few million (possibly long) strings, and need to know if each of them contains any of these given patterns:

regex1
regex2
regex3
...

Performance wise, would it be better to:

Test each string against the "full" regular expression /regex1|regex2|regex3|.../, or
Test each string against regex1, if didn't match then test against regex2, and so on...?

I was wondering about this and, as my knowledge about regex implementations is very limited, I have no idea if these would output similar behavior or not.

Edit: I just did a quick benchmarking. Didn't think too much, just blurted out some code. Please point out anything that might be biasing the output.

This is JavaScript, and I did the test with Node.js.

Note: I tried running with 5 million strings and 500 regexes, but the process ran out of memory, so I lowered the numbers

"use strict";

var strMinSize      = 50;
var strMaxSize      = 500;
var howManyStrings  = 100000;  // hundred thousand
var howManyRegex    = 50;      // fifty

var possible = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

function makestr() {
    var text = "";
    var strSize = Math.floor(Math.random() * strMaxSize) + strMinSize;
    for (var i=0; i < strSize; i++) {
      text += possible.charAt(Math.floor(Math.random() * possible.length));
    }
    return text;
}

function makeregex() {
    var regexstr = "";
    var regexSize = Math.floor(Math.random() * 50) + 5;
    for (var i=0; i < regexSize; ++i) {
      regexstr += possible.charAt(Math.floor(Math.random() * possible.length));
    }
    return regexstr;
}

var stringList = [];
for (var i=0; i < howManyStrings; ++i) {
    stringList.push(makestr());
}
var regexList = [];
var fullRegex = ""; // aux to build the disjunction
for (var i=0; i < howManyRegex; ++i) {
    var localRegex = makeregex();
    regexList.push(new RegExp(localRegex));
    fullRegex += '|' + localRegex;
}
fullRegex = new RegExp(fullRegex.substr(1));

// let's do this...

for (var kase=1; kase < 10; ++kase) {
    // Test 1: one disjunction with every regex
    var time1 = 0;
    var time2 = 0;

    var start = new Date().getTime();
    stringList.forEach( function(str) {
      fullRegex.test(str);
    });
    var end = new Date().getTime();
    time1 = end - start;

    // Test 2: one regex at a time
    start = new Date().getTime();
    stringList.forEach( function(str) {
      regexList.every( function(rx) {
        if (rx.test(str)) {
          return false;
        } else {
          return true;
        }
      });
    });
    end = new Date().getTime();
    time2 = end - start;

    console.log(time1 + ";" + time2);
}

The running times were:

+--------+---------+
| Test 1 | Test 2  |
+--------+---------+
|   813  |  1817   |
|   558  |  1750   |
|   566  |  1756   |
|   558  |  1783   |
|   560  |  1755   |
|   559  |  1736   |
|   551  |  1749   |
|   552  |  1743   |
|   558  |  1746   |
+--------+---------+

So, as I suspected, the second alternative is way worse... But why so much?

Bohemian · Accepted Answer

One regex will always be faster, because each regex test requires a pass over the input, and even though the combined regex is (slightly) more complex than the individual expressions, it is still a constant time computation.

Expressing the problem using "big O" notation:

single regex evaluation at given location in input = O(1)
combined regex evaluation at given location in input = effectively O(1)
regex match on string = O(n) (where n = string length)

From these, we can say that individual passes for each term = O(n * k) where k is the number of regexes/terms, but one regex is O(n).

This is born out from your tests, which show roughly 3 times as slow for the separate regexes.

This all hinges on the premise the the combined regex is "about as fast" as a simple one. This is the case because the regex state engine is extremely efficient, reducing the execution time to practically the same for an simple alternation as a plain pattern. It is a little slower, but no where near slow enough to warrant separate passes for separate regexes, no matter how long the list of terms became.

Which has better performance when matching a string: testing against the regex /regex1|regex2|regex3/, or test one pattern after the other?

Tags:

regex

rgoliveira

1 Answers

Bohemian

Recent Activity

Donate For Us

Which has better performance when matching a string: testing against the regex /regex1|regex2|regex3/, or test one pattern after the other?

Tags:

regex

rgoliveira

1 Answers

Bohemian

Related questions

Recent Activity

Donate For Us