Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Is there an option for flex to match whole words only?

I'm writing a lexer and I'm using Flex to generate it based on custom rules.

I want to match identifiers of sorts that start with a letter and then can have either letters or numbers. So I wrote the following pattern for them:

[[:alpha:]][[:alnum:]]*

It works fine, the lexer that gets generated recognizes the pattern perfectly, although it doesn't only match whole words but all appearances of that pattern.

So for example it would match the input "Text" and "9Text" (discarding that initial 9).

Consider the following simple lexer that accepts IDs as described above:

%{
#include <stdio.h>

#define LINE_END 1
#define ID       2

%}

/* Flex options: */
%option noinput
%option nounput
%option noyywrap
%option yylineno

/* Definitions: */
WHITESPACE  [ \t]
BLANK       {WHITESPACE}+
NEW_LINE    "\n"|"\r\n"
ID          [[:alpha:]][[:alnum:]_]*

%%

{NEW_LINE}        {printf("New line.\n"); return LINE_END;}
{BLANK}           {/* Blanks are skipped */}
{ID}              {printf("ID recognized: '%s'\n", yytext); return ID;}
.                 {fprintf(stderr, "ERROR: Invalid input in line %d: \"%s\"\n", yylineno, yytext);}

%%

int main(int argc, char **argv) {
   while (yylex() != 0);
   return 0;
}

When compiled and fed the following input produces the output below:

Input:

Test
9Test

Output:

Test
ID recognized: 'Test'
New line.
9Test
ERROR: Invalid input in line 2: "9"
ID recognized: 'Test'
New line.

Is there a way to make flex match only whole words (i.e. delimited by either blanks or custom delimiters like '(' ')' for example)?

Because I could write a rule that excludes IDs that start with numbers, but what about the ones that start with symbols like "$Test" or "&Test"? I don't think I can enumerate all of the possible symbols.

Following the example above, the desired output would be:

Test
ID recognized: 'Test'
New line.
9Test
ERROR: Invalid input 2: "9Test"
New line.
like image 721
James Russell Avatar asked Dec 04 '25 19:12

James Russell


1 Answers

You seem to be asking two questions at once.

  1. 'Whole word' isn't a recognized construct in programming languages. The lexical and grammar are already defined. Just implement them.

  2. The best way to handle illegal or unexpected characters in flex is not to handle them specially at all. Return them to the parser, just as you would for a special character. Then the parser can deal with it and attempt recovery via discarding.

Place this as you final rule:

. return yytext[0];
like image 124
user207421 Avatar answered Dec 07 '25 17:12

user207421



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!