Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to accurately count tokens in a C program? Do preprocessor directives like #include and #define count?

I'm studying how tokenization works in C programming, especially for examination purposes. I want to understand how many tokens are present in the following C code and how preprocessor directives are counted as tokens.

Here’s the code I'm analyzing:

#include <stdio.h> 
#define PI 3.14159
#define SQUARE(x) ((x) * (x))

int main() {
    float radius = 5.0;
    float area = PI * SQUARE(radius);
    printf("Area of circle: %.2f\n", area);
    return 0;
}

I'm confused about the following:

  1. Are preprocessor directives like #include and #define counted when tokenizing in the C language (When ask as a question in exam paper)?
  2. if this count as token then <stdio.h> considered a single token or separate tokens like <,stdio.h, >?
  3. Should we count macro parameters like (x) and macro body tokens?
  4. How many total tokens are there in this code from a C compiler's perspective?
  5. What is the exact rule or reference from the C standard or GCC documentation that explains this clearly?
like image 625
rayhanrjt Avatar asked Oct 14 '25 03:10

rayhanrjt


1 Answers

  1. Are preprocessor directives like #include and #define counted when tokenizing in the C language (When ask as a question in exam paper)?

The C grammar and analysis is specified in the C standard as parsing the source text into preprocessing tokens, which include # and include as preprocessing tokens. Then preprocessing is performed. After that, all preprocessing tokens are converted to tokens, and the main compilation occurs. (This is a conceptual order in the C standard, not necessarily the actual order used in a compiler.)

You will have to determine whether you want to count preprocessing tokens before preprocessing or count tokens after preprocessing.

#include and #define are not tokens (preprocessing tokens or tokens). The tokens are #, include, #, and define.

  1. if this count as token then <stdio.h> considered a single token or separate tokens like <,stdio.h, >?

The tokens in #include <stdio.h> are #, include, and <stdio.h>. There is a special sub-grammar for header names that results in <stdio.h> being a single token. Outside of an #include directive and certain other places where a header name is expected, <stdio.h> would be multiple tokens: <, stdio, ., h, and >. This means any parser to count tokens must be context-dependent.

  1. Should we count macro parameters like (x) and macro body tokens?

A token is a token.

  1. How many total tokens are there in this code from a C compiler's perspective?

I manually counted 53 preprocessing tokens but easily could have made a mistake.

  1. What is the exact rule or reference from the C standard or GCC documentation that explains this clearly?

There is no single rule. The grammar for a preprocessing token starts in C 2024 6.4.1 where it defines preprocessing-token to be one of header-name, identifier, pp-number, character-constant, string-literal, punctuator, a universal character name that cannot be one of the aforementioned, or a non-white-space character that cannot be one of the aforementioned. Definitions of those continue in other parts of the C standard. To count tokens, you will have to parse at least the preprocessing grammar of the C standard.

like image 146
Eric Postpischil Avatar answered Oct 17 '25 21:10

Eric Postpischil