Suppose I have a few million (possibly long) strings, and need to know if each of them contains any of these given patterns:
regex1regex2regex3...Performance wise, would it be better to:
Test each string against the "full" regular expression /regex1|regex2|regex3|.../, or
Test each string against regex1, if didn't match then test against regex2, and so on...?
I was wondering about this and, as my knowledge about regex implementations is very limited, I have no idea if these would output similar behavior or not.
Edit: I just did a quick benchmarking. Didn't think too much, just blurted out some code. Please point out anything that might be biasing the output.
This is JavaScript, and I did the test with Node.js.
Note: I tried running with 5 million strings and 500 regexes, but the process ran out of memory, so I lowered the numbers
"use strict";
var strMinSize = 50;
var strMaxSize = 500;
var howManyStrings = 100000; // hundred thousand
var howManyRegex = 50; // fifty
var possible = " ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789";
function makestr() {
var text = "";
var strSize = Math.floor(Math.random() * strMaxSize) + strMinSize;
for (var i=0; i < strSize; i++) {
text += possible.charAt(Math.floor(Math.random() * possible.length));
}
return text;
}
function makeregex() {
var regexstr = "";
var regexSize = Math.floor(Math.random() * 50) + 5;
for (var i=0; i < regexSize; ++i) {
regexstr += possible.charAt(Math.floor(Math.random() * possible.length));
}
return regexstr;
}
var stringList = [];
for (var i=0; i < howManyStrings; ++i) {
stringList.push(makestr());
}
var regexList = [];
var fullRegex = ""; // aux to build the disjunction
for (var i=0; i < howManyRegex; ++i) {
var localRegex = makeregex();
regexList.push(new RegExp(localRegex));
fullRegex += '|' + localRegex;
}
fullRegex = new RegExp(fullRegex.substr(1));
// let's do this...
for (var kase=1; kase < 10; ++kase) {
// Test 1: one disjunction with every regex
var time1 = 0;
var time2 = 0;
var start = new Date().getTime();
stringList.forEach( function(str) {
fullRegex.test(str);
});
var end = new Date().getTime();
time1 = end - start;
// Test 2: one regex at a time
start = new Date().getTime();
stringList.forEach( function(str) {
regexList.every( function(rx) {
if (rx.test(str)) {
return false;
} else {
return true;
}
});
});
end = new Date().getTime();
time2 = end - start;
console.log(time1 + ";" + time2);
}
The running times were:
+--------+---------+
| Test 1 | Test 2 |
+--------+---------+
| 813 | 1817 |
| 558 | 1750 |
| 566 | 1756 |
| 558 | 1783 |
| 560 | 1755 |
| 559 | 1736 |
| 551 | 1749 |
| 552 | 1743 |
| 558 | 1746 |
+--------+---------+
So, as I suspected, the second alternative is way worse... But why so much?
One regex will always be faster, because each regex test requires a pass over the input, and even though the combined regex is (slightly) more complex than the individual expressions, it is still a constant time computation.
Expressing the problem using "big O" notation:
From these, we can say that individual passes for each term = O(n * k) where k is the number of regexes/terms, but one regex is O(n).
This is born out from your tests, which show roughly 3 times as slow for the separate regexes.
This all hinges on the premise the the combined regex is "about as fast" as a simple one. This is the case because the regex state engine is extremely efficient, reducing the execution time to practically the same for an simple alternation as a plain pattern. It is a little slower, but no where near slow enough to warrant separate passes for separate regexes, no matter how long the list of terms became.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With