I want to match URLs with a common prefix (/files) but only two specific directories - images and videos - under it. I can think of two regular expressions for this:
/files/(images/.*|videos/.*)
and
/files/(image|video)s/.*
I have two questions here:
The performance matters to me since I'll be using it to match billions of string. So, any tiny bit of improvement also matters to me.
Which one is better from a performance perspective? My guess is the second one since its DFA will have a smaller number of states.
Both of the expressions have the same number of states in the minimal DFA, and their DFAs matches the same "language" (in theoretical sense).
Regardless of the number of states in the DFA, the performance is theoretically the same, since you will traverse through the states deterministically as you feed the input to the automaton.
In practice, there might be difference due to cache miss, which may happen more often when there are more states. However, unless you are working on the regex engine, I can't think of any good reason to spend time doing black-box optimization.
Is there a general purpose programming language whose built-in regex compiler will reduce the given regex into the minimal DFA?
Go (re2) and Haskell has engines that converts regex into DFA. I don't know whether the DFA is minimal or not, though.
Since POSIX ERE doesn't support back-reference (back-reference is different from reference to capturing group in replacement), it is possible to write an engine for POSIX ERE that runs on DFA or efficient NFA-simulation. However, since the standard doesn't mandate such implementation as long as the result is correct (match the left-most longest string), the implementation can exhaustively search for all strings that matches the regex on an NFA backtracking engine.
However, at least GNU egrep seems to implements POSIX ERE with DFA (based on the file name dfa.c).
For your information, there are 3 approaches to implement a regular expression matching:
For more details, this article (cited in this question) explains:
By the way, re2 engine (mentioned above) includes implementation of DFA-based and NFA-based (efficient simulation) matching algorithm.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With