Why C99 has such an odd restriction for universal character names?

Question

6.4.3 Universal character names

A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (`), nor one in the range D800 through DFFF inclusive.

Besides the fact that it is no longer "universal" with restrictions like this, I can't think of good reasons for such a restriction. Anyone knows the backstory?

rici · Accepted Answer

D800 through DFFF inclusive are not valid code points; they are high and low surrogates, which can only be found in pairs in UTF-16 encoding in order to represent code points outside of the base plane.

The other restriction avoids having a universal character name collide with a character which could be represented in the C character set, for the benefit of compilers which don't bother resolving universal character names into their unicode equivalents. So the compiler is under no obligation to recognize a + written as \u002B or to know that a and \u0061 represent the same name. ($, @ and ` are not valid in a C program outside of comments and character strings, so they do not require any special attention from the lexer.)

The range of code points less than A0 also includes control characters and whitespace. (C does not consider \u00A0 to be whitespace.)

Why C99 has such an odd restriction for universal character names?

Tags:

c

unicode

an0

1 Answers

rici

Recent Activity

Donate For Us

Why C99 has such an odd restriction for universal character names?

Tags:

c

unicode

an0

1 Answers

rici

Related questions

Recent Activity

Donate For Us