Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Which has better performance when matching a string: testing against the regex /regex1|regex2|regex3/, or test one pattern after the other?

Tags:

regex

Suppose I have a few million (possibly long) strings, and need to know if each of them contains any of these given patterns:

  • regex1
  • regex2
  • regex3
  • ...

Performance wise, would it be better to:

  1. Test each string against the "full" regular expression /regex1|regex2|regex3|.../, or

  2. Test each string against regex1, if didn't match then test against regex2, and so on...?

I was wondering about this and, as my knowledge about regex implementations is very limited, I have no idea if these would output similar behavior or not.


Edit: I just did a quick benchmarking. Didn't think too much, just blurted out some code. Please point out anything that might be biasing the output.

This is JavaScript, and I did the test with Node.js.

Note: I tried running with 5 million strings and 500 regexes, but the process ran out of memory, so I lowered the numbers

"use strict";

var strMinSize      = 50;
var strMaxSize      = 500;
var howManyStrings  = 100000;  // hundred thousand
var howManyRegex    = 50;      // fifty

var possible = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";

function makestr() {
    var text = "";
    var strSize = Math.floor(Math.random() * strMaxSize) + strMinSize;
    for (var i=0; i < strSize; i++) {
      text += possible.charAt(Math.floor(Math.random() * possible.length));
    }
    return text;
}

function makeregex() {
    var regexstr = "";
    var regexSize = Math.floor(Math.random() * 50) + 5;
    for (var i=0; i < regexSize; ++i) {
      regexstr += possible.charAt(Math.floor(Math.random() * possible.length));
    }
    return regexstr;
}

var stringList = [];
for (var i=0; i < howManyStrings; ++i) {
    stringList.push(makestr());
}
var regexList = [];
var fullRegex = ""; // aux to build the disjunction
for (var i=0; i < howManyRegex; ++i) {
    var localRegex = makeregex();
    regexList.push(new RegExp(localRegex));
    fullRegex += '|' + localRegex;
}
fullRegex = new RegExp(fullRegex.substr(1));

// let's do this...

for (var kase=1; kase < 10; ++kase) {
    // Test 1: one disjunction with every regex
    var time1 = 0;
    var time2 = 0;

    var start = new Date().getTime();
    stringList.forEach( function(str) {
      fullRegex.test(str);
    });
    var end = new Date().getTime();
    time1 = end - start;

    // Test 2: one regex at a time
    start = new Date().getTime();
    stringList.forEach( function(str) {
      regexList.every( function(rx) {
        if (rx.test(str)) {
          return false;
        } else {
          return true;
        }
      });
    });
    end = new Date().getTime();
    time2 = end - start;

    console.log(time1 + ";" + time2);
}

The running times were:

+--------+---------+
| Test 1 | Test 2  |
+--------+---------+
|   813  |  1817   |
|   558  |  1750   |
|   566  |  1756   |
|   558  |  1783   |
|   560  |  1755   |
|   559  |  1736   |
|   551  |  1749   |
|   552  |  1743   |
|   558  |  1746   |
+--------+---------+

So, as I suspected, the second alternative is way worse... But why so much?

like image 893
rgoliveira Avatar asked Jan 24 '26 21:01

rgoliveira


1 Answers

One regex will always be faster, because each regex test requires a pass over the input, and even though the combined regex is (slightly) more complex than the individual expressions, it is still a constant time computation.

Expressing the problem using "big O" notation:

  • single regex evaluation at given location in input = O(1)
  • combined regex evaluation at given location in input = effectively O(1)
  • regex match on string = O(n) (where n = string length)

From these, we can say that individual passes for each term = O(n * k) where k is the number of regexes/terms, but one regex is O(n).

This is born out from your tests, which show roughly 3 times as slow for the separate regexes.

This all hinges on the premise the the combined regex is "about as fast" as a simple one. This is the case because the regex state engine is extremely efficient, reducing the execution time to practically the same for an simple alternation as a plain pattern. It is a little slower, but no where near slow enough to warrant separate passes for separate regexes, no matter how long the list of terms became.

like image 178
Bohemian Avatar answered Jan 26 '26 18:01

Bohemian