I'm reading the ECMAScript Specification 5th edition but there is a point that it is not pretty clear, in my opinion.
In Section 6 - Source Text the specification declares a source character as follow:
SourceCharacter::
any Unicode code unit
and after it says:
Throughout the rest of this document, the phrase "code unit" and the word "character" will be used to refer to a 16-bit unsigned value used to represent a single 16-bit unit of text. The phrase "Unicode character" will be used to refer to the abstract linguistic or typographical unit represented by a single Unicode scalar value (which may be longer than 16 bits and thus may be represented by more than one code unit).
I think that this sentence it's a bit ambiguous because someone (as me initially) can think that the only allowed characters are those between 0-65535 of the Unicode table.
So, is the sentence ambiguous or only characters between 0-65536 can be used?
It is intentional, they're telling you that any code unit is allowed and then saying that for clarify after the definition of source character they typically mean code unit rather than character.
Note that in UTF16 a code unit is different from a code point.
Everything is a code unit which is 16 bit, but code points can be composed from several code units.
For example "💩" is a single UTF16 code point but it is two code units.
"💩".charCodeAt(0) // 55357
"💩".charCodeAt(1) // 56589
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With