I'm studying how tokenization works in C programming, especially for examination purposes. I want to understand how many tokens are present in the following C code and how preprocessor directives are counted as tokens.
Here’s the code I'm analyzing:
#include <stdio.h>
#define PI 3.14159
#define SQUARE(x) ((x) * (x))
int main() {
float radius = 5.0;
float area = PI * SQUARE(radius);
printf("Area of circle: %.2f\n", area);
return 0;
}
I'm confused about the following:
- Are preprocessor directives like #include and #define counted when tokenizing in the C language (When ask as a question in exam paper)?
The C grammar and analysis is specified in the C standard as parsing the source text into preprocessing tokens, which include #
and include
as preprocessing tokens. Then preprocessing is performed. After that, all preprocessing tokens are converted to tokens, and the main compilation occurs. (This is a conceptual order in the C standard, not necessarily the actual order used in a compiler.)
You will have to determine whether you want to count preprocessing tokens before preprocessing or count tokens after preprocessing.
#include
and #define
are not tokens (preprocessing tokens or tokens). The tokens are #
, include
, #
, and define
.
- if this count as token then <stdio.h> considered a single token or separate tokens like <,stdio.h, >?
The tokens in #include <stdio.h>
are #
, include
, and <stdio.h>
. There is a special sub-grammar for header names that results in <stdio.h>
being a single token. Outside of an #include
directive and certain other places where a header name is expected, <stdio.h>
would be multiple tokens: <
, stdio
, .
, h
, and >
. This means any parser to count tokens must be context-dependent.
- Should we count macro parameters like (x) and macro body tokens?
A token is a token.
- How many total tokens are there in this code from a C compiler's perspective?
I manually counted 53 preprocessing tokens but easily could have made a mistake.
- What is the exact rule or reference from the C standard or GCC documentation that explains this clearly?
There is no single rule. The grammar for a preprocessing token starts in C 2024 6.4.1 where it defines preprocessing-token to be one of header-name, identifier, pp-number, character-constant, string-literal, punctuator, a universal character name that cannot be one of the aforementioned, or a non-white-space character that cannot be one of the aforementioned. Definitions of those continue in other parts of the C standard. To count tokens, you will have to parse at least the preprocessing grammar of the C standard.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With