Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why C99 has such an odd restriction for universal character names?

Tags:

c

unicode

6.4.3 Universal character names

A universal character name shall not specify a character whose short identifier is less than 00A0 other than 0024 ($), 0040 (@), or 0060 (`), nor one in the range D800 through DFFF inclusive.

Besides the fact that it is no longer "universal" with restrictions like this, I can't think of good reasons for such a restriction. Anyone knows the backstory?

like image 955
an0 Avatar asked Oct 19 '25 04:10

an0


1 Answers

D800 through DFFF inclusive are not valid code points; they are high and low surrogates, which can only be found in pairs in UTF-16 encoding in order to represent code points outside of the base plane.

The other restriction avoids having a universal character name collide with a character which could be represented in the C character set, for the benefit of compilers which don't bother resolving universal character names into their unicode equivalents. So the compiler is under no obligation to recognize a + written as \u002B or to know that a and \u0061 represent the same name. ($, @ and ` are not valid in a C program outside of comments and character strings, so they do not require any special attention from the lexer.)

The range of code points less than A0 also includes control characters and whitespace. (C does not consider \u00A0 to be whitespace.)

like image 97
rici Avatar answered Oct 21 '25 19:10

rici