I'm trying to implement a lexer for fun. I have already implemented a basic regular expression matcher(by first converting a pattern to a NFA and then to a DFA). Now I'm clueless about how to proceed. 
My lexer would be taking a list of tokens and their corresponding regexs. What is the general algorithm used to create a lexer out of this? 
I thought about (OR)ing all the regex, but then I can't identify which specific token was matched. Even if I extend my regex module to return the pattern matched when a match is successful, how do I implement lookahead in the matcher?
Assuming you have a working regex, regex_match which returns a boolean (True if a string satisfies the regex). First, you need to  have an ordered list of tokens (with regex for each) tokens_regex, the order being important as order will prescribe precedence.
One algorithm could be (this is not necessarily the only one):
next_token which takes a string, and returns the first token, its value and the remaining string (or - if an illegal/ignore character - None, the offending character and the remaining string). Note: this has to respect precedence, and should find the longest token.
lex which recursively calls next_token..
Something like this (written in Python):
tokens_regex = [ (TOKEN_NAME, TOKEN_REGEX),...] #order describes precedence
def next_token( remaining_string ):
    for t_name, t_regex in tokens_regex: # check over in order of precedence
        for i in xrange( len(remaining_string), 0, -1 ): #check longest possibilities first (there may be a more efficient method).
            if regex_match( remaining_string[:i], t_regex ):
                return t_name, remaining_string[:i], remaining_string[i:]
    return None, remaining_string[0], remaining_string[1:] #either an ignore or illegal character
def lex( string ):
    tokens_so_far = []
    remaining_string = string
    while len(remaining_string) > 0:
        t_name, t_value, string_remaining = next_token(remaining_string)
        if t_name is not None:
            tokens_so_far.append(t_name, t_value)
        #elif not regex_match(t_value,ignore_regex):
            #check against ignore regex, if not in it add to an error list/illegal characters
   return tokens_so_far
Some things to add to improve your lexer: ignore regex, error lists and locations/line numbers (for these errors or for tokens).
Have fun! And good luck making a parser :).
I've done pretty much the same thing. The way I did it was to combine all the expressions in one pretty big NFA and converted that same thing into one DFA. When doing that keep track of the states that previously were accepting states in each corresponding original DFA and their precedence.
The generated DFA will have many states that are accepting states. You run this DFA until it recieves a character that it has no corresponding transitions for. If the DFA is then in an accepting state you will then look at which of your original NFAs that had that accepting state in them. The one that has the highest precedence is the token you're going to return.
This does not handle regular expression lookaheads. These are typically not really needed for lexer work anyway. That would be the job of the parser.
Such a lexer runs in much the same speed as a single regular expression since there is basically only one DFA for it to run. You can omit converting the NFA altogether for a faster-to-construct algorithm but slower to run. The algorithm is basically the same.
The source code for the lexer I wrote is freely available on github, should you want to see how I did it.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With