I recently revisited an old assignment that I did not get to work and I am still curious as to why it isn't.
This assignment was to create a lex regular expression in C that would validate both expressions and assignments. I got the assignments to validate correctly but the lexer says expressions are invalid. I've been trying to figure out what is wrong with my code, but I hit a dead end. Can anyone help me?
Code, name a3.l:
/*regular definitions*/
id [a-zA-Z]+[a-zA-Z0-9]*[ \t]
op [-|+|"*"|"/"|%][ \t]
equ [=][ \t]
expr {id}{op}{id}({op}{id})*\n
assmt {id}{equ}({id}{op}{id}({op}{id})*)[;]\n
%%
[\n] printf("\nInvalid input\n");
{expr} printf("%sLegal expression \n", yytext);
{assmt} printf("%sLegal assignment \n", yytext);
Instructions for compiling and testing solution
lex a3.l #create the lex.yy.c file to validate expressions
gcc lex.yy.c -lfl -o a3 #compile lex.yy.c file to a program called a3
./a3 < in.txt > out.txt #execute a3 to read in.txt and throw validation results in out.txt
These are the contents of in.txt:
good = one1 + two2 - three3 / four4 ;
good = one1 / two2 * three3 ;
good = one1 * two2 + three3 ;
good = ONE + twenty - three3 ;
good = old * thirty2 / b567 ;
good * i8766e98e + bignum
good % a4 + bignum
good * one - two2 / three3
bad = = one1 + two2 - three3 / four4 ;
bad = one + two2 - three3 / four4
bad = one + - two2 - three3 / four4 ;
bad = one + two2 ? three3 / four4 ;
bad = 4 + ( one1 * two2 ) * ( three3 + four4 ;
bad = one1 + 24 - three3 ;
bad +- delta
bad / min = fourth ;
bad = a + b
bad = a ! b
bad = 2two + 3three ;
bad * 2two + 3three
good + two
bad + notgood ;
And this is the result after validating, out.txt:
good = one1 + two2 - three3 / four4 ;
Legal assignment
good = one1 / two2 * three3 ;
Legal assignment
good = one1 * two2 + three3 ;
Legal assignment
good = ONE + twenty - three3 ;
Legal assignment
good = old * thirty2 / b567 ;
Legal assignment
good * i8766e98e + bignum
Invalid input
good % a4 + bignum
Invalid input
good * one - two2 / three3
Invalid input
bad = = one1 + two2 - three3 / four4 ;
Invalid input
bad = one + two2 - three3 / four4
Invalid input
bad = one + - two2 - three3 / four4 ;
Invalid input
bad = one + two2 ? three3 / four4 ;
Invalid input
bad = 4 + ( one1 * two2 ) * ( three3 + four4 ;
Invalid input
bad = one1 + 24 - three3 ;
Invalid input
bad +- delta
Invalid input
bad / min = fourth ;
Invalid input
bad = a + b
Invalid input
bad = a ! b
Invalid input
bad = 2two + 3three ;
Invalid input
bad * 2two + 3three
Invalid input
good + two
Invalid input
bad + notgood ;
Invalid input
The whitespace handling is brittle. Each token (with the exception of ;
) is expected to end with exactly one space or tab ([ \t]
). Expressions end with an id
...which in turn must end with a space or tab, and you don't have those in the examples. If you instead look for [ \t]*
at the end of id
, op
, and equ
, so as to accept zero or more whitespace characters, it'll work.
This doesn't show up for assignments because those end with a semicolon literal. At least, it doesn't show in for the examples. Add or remove a space somewhere and those will fail, too.
Well, firstly, you are mixing a bit tokenization and syntax here. Part of the job you are doing with lex
should belong to yacc
or bison
. I guess you know that, but I think it is worth writing it.
Especially since it is not unrelated to your problem. Because that is the reason, I guess, you end up adding spaces to your token definition.
Especially and the end of id
definition.
Which kind of works for assignment.
good = one1 * two2 + three3 ;
is an assignment, since it is made of
«good »«= »«one1 »«* »«two2 »«+ »«three3 »«;»«\n»
which are id
, equ
, id
, op
, id
, op
, id
;
and \n
, so a valid assignment
But not for expression, because of that same space
good % a4 + bignum
if you split that into tokens (which, I am not really supposed to do, because there is no hierarchy in tokens, like with syntax. It is "flattened". It is only because it is an expression (or at least I know that is your question : "why is not an expression", that I can start to reason with subtokens «id» and «op» etc. But nothing prevents to have some other token rule making go
a token and od % a
another. Yet it is simple enough to "hand debug" how it could be split into "subtokens" — again, the simple usage of "subtoken" term gives the wrong impression about lex — if it were to be an expression).
So, I was saying, if you split than into tokens, you get:
«good »«% »«a4 »«+ »«bignum»«\n»
which is id
, op
, id
, op
, b
, i
, g
, n
, u
, m
, \n
(there is no other token way to describe bignum
without a space after it)
Which is not a expr
short-answer is: your expression end with an "id without ending space", and they need to end with an id
(that is, with an ending space).
longer answer (if the short answer was the only one, that would almost be "not reproducible or caused by a typo"): you should not be analyzing syntax with lex. There is no hierarchy of tokens with lex.
Sure, you could correct this by creating a token id
without the ending space, and then adding space explicitly after each usage of id
in expr
and assmt
, but for the last id
of expr
. But that is really torturing yourself by using the wrong tool for the wrong task (and ending up with a very strict language, where there must be one and only one space between words, etc.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With