I've recently been reading up on the UTF-8 variable-width encoding, and I found it strange that UTF-8 specifies the first two bits of every continuation byte to be 10.
 Range           |  Encoding
-----------------+-----------------
     0 - 7f      |  0xxxxxx
    80 - 7ff     |  110xxxx 10xxxxxx
   800 - ffff    |  1110xxx 10xxxxxx 10xxxxxx
 10000 - 10ffff  |  11110xx 10xxxxxx 10xxxxxx 10xxxxxx
I was playing around with other possible variable width encodings, and found that by using the following scheme, at most 3 bytes are necessary to store all of Unicode. If the first bit is a 1, then the character is encoded in at least one more byte (read until the first bit is a 0).
 Range           |  Encoding
-----------------+-----------------
     0 - 7f      |  0xxxxxx
    80 - 407f    |  1xxxxxx 0xxxxxxx
  4080 - 20407f  |  1xxxxxx 1xxxxxxx 0xxxxxxx
Are the continuation bits in UTF-8 really that important? The second encoding seems much more efficient.
The UTF-8 is self-validating, fast on stepping forward, and easier to step backward.
Self-validating: Since the first byte in the sequence specifies the length, the next X bytes must fit 10xxxxxx, or you have an invalid sequence. Seeing a 10xxxxxx byte by itself is immediately recognizable as invalid.
Your suggested encoding has no validation built-in.
Fast on step forward: If you have to skip the character, you can immediately skip X bytes as determined by the first byte, without having to examine each intermediate byte.
Easier to step backward: If you have to read the bytes backwards, you can immediately recognize a continuation character by the 10xxxxxx. You'll then be able to scan backwards past the 10xxxxxx bytes for the 11xxxxxx lead byte, without having to scan past the lead byte.
See UTF-8 Invalid sequences and error handling on Wikipedia.
Apart from ease of iteration as already mentioned: UTF-8 aims to be safe for ASCII-based (and other UTF-8-unaware) tools to process through such common manipulations as searching, concatenation, replacing, and escaping.
The advantages of ASCII-compatibilty for interop and security outweigh the costs of using an extra byte for characters U+0800 to U+407F.
80 - 407f | 1xxxxxx 0xxxxxxx
So there were a few East Asian multibyte encodings that did it like that, with some unfortunate results which UTF-8 was specifically trying to avoid.
In this proposed scheme the continuation bytes now overlap with ASCII, and many ASCII characters have special meanings to different languages and tools. So if you want to say ¢ that's 0x80,0x27 and the second byte of that looks like a " to any tool that manipulates byte strings without support for, and knowledge that this data using, the proposed encoding.
Cue security holes in everything that combines user input into control flow. SQL injection in queries, HTML injection on web pages, command injection in shell scripts and so on.
(The East Asian multibyte encodings weren't quite as bad as this encoding here, as they didn't reuse the ASCII control codes as continuation bytes. As proposed, text using this encoding can't be stored in a C null-terminated string, for example. Still, Shift-JIS and friends caused a whole bunch of security holes and we are all very glad to be rid of them.)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With