I am reading a book (Programming Principles and Practice by Bjarne Stroustrup).
In which he introduce Tokens:
“A token is a sequence of characters that represents something we consider a unit, such as a number or an operator. That’s the way a C++ compiler deals with its source. Actually, “tokenizing” in some form or another is the way most analysis of text starts.”
class Token {
public:
char kind;
double value;
};
I do get what they are but he never explains this in detail and its quite confusing to me.
Tokenizing is important to the process of figuring out what a program does. What Bjarne is referring to in relation to C++ source deals with how a programs meaning is affected by the tokenization rules. In particular, we must know what the tokens are, and how they are determined. Specifically, how can we identify a single token when it appears next to other characters, and how should we delimit tokens if there is ambiguity.
For instance, consider the prefix operators ++
and +
. Let's assume we only had one token +
to work with. What is the meaning of the following snippet?
int i = 1;
++i;
With +
only, is the above going to just apply unary +
on i
twice? Or is it going to increment it once? It's ambiguous, naturally. We need an additional token, and therefore introduce ++
as it's own "word" in the language.
But now there is another (albeit smaller) problem. What if the programmer wants to just apply unary +
twice, and not increment? Token processing rules are needed. So if we determine that a white space is always a separator for tokens, our programmer may write:
int i = 1;
+ +i;
Roughly speaking, a C++ implementation starts with a file full of characters, transforms them initially to a sequence of tokens ("words" with meaning in the C++ language), and then checks if the tokens appear in a "sentence" that has some valid meaning.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With