How many characters can UTF-8 encode?

People also ask

What is the range of UTF-8?

UTF-8 Basics. UTF-8 (Unicode Transformation–8-bit) is an encoding defined by the International Organization for Standardization (ISO) in ISO 10646. It can represent up to 2,097,152 code points (2^21), more than enough to cover the current 1,112,064 Unicode code points.

Can UTF-8 represent all characters?

Each UTF can represent any Unicode character that you need to represent. UTF-8 is based on 8-bit code units. Each character is encoded as 1 to 4 bytes. The first 128 Unicode code points are encoded as 1 byte in UTF-8.

How many characters can be encoded 8 bits?

Eight bits are called a byte. One byte character sets can contain 256 characters. The current standard, though, is Unicode which uses two bytes to represent all characters in all writing systems in the world in a single set.

Does UTF-16 have more characters than UTF-8?

UTF-16 is better where ASCII is not predominant, since it uses 2 bytes per character, primarily. UTF-8 will start to use 3 or more bytes for the higher order characters where UTF-16 remains at just 2 bytes for most characters.

UTF-8 does not use one byte all the time, it's 1 to 4 bytes.

The first 128 characters (US-ASCII) need one byte.

The next 1,920 characters need two bytes to encode. This covers the remainder of almost all Latin alphabets, and also Greek, Cyrillic, Coptic, Armenian, Hebrew, Arabic, Syriac and Tāna alphabets, as well as Combining Diacritical Marks.

Three bytes are needed for characters in the rest of the Basic Multilingual Plane, which contains virtually all characters in common use[12] including most Chinese, Japanese and Korean [CJK] characters.

Four bytes are needed for characters in the other planes of Unicode, which include less common CJK characters, various historic scripts, mathematical symbols, and emoji (pictographic symbols).

source: Wikipedia

UTF-8 uses 1-4 bytes per character: one byte for ascii characters (the first 128 unicode values are the same as ascii). But that only requires 7 bits. If the highest ("sign") bit is set, this indicates the start of a multi-byte sequence; the number of consecutive high bits set indicates the number of bytes, then a 0, and the remaining bits contribute to the value. For the other bytes, the highest two bits will be 1 and 0 and the remaining 6 bits are for the value.

So a four byte sequence would begin with 11110... (and ... = three bits for the value) then three bytes with 6 bits each for the value, yielding a 21 bit value. 2^21 exceeds the number of unicode characters, so all of unicode can be expressed in UTF8.

Unicode vs UTF-8

Unicode resolves code points to characters. UTF-8 is a storage mechanism for Unicode. Unicode has a spec. UTF-8 has a spec. They both have different limits. UTF-8 has a different upwards-bound.

Unicode

Unicode is designated with "planes." Each plane carries 2¹⁶ code points. There are 17 Planes in Unicode. For a total of 17 * 2^16 code points. The first plane, plane 0 or the BMP, is special in the weight of what it carries.

Rather than explain all the nuances, let me just quote the above article on planes.

The 17 planes can accommodate 1,114,112 code points. Of these, 2,048 are surrogates, 66 are non-characters, and 137,468 are reserved for private use, leaving 974,530 for public assignment.

UTF-8

Now let's go back to the article linked above,

The encoding scheme used by UTF-8 was designed with a much larger limit of 2³¹ code points (32,768 planes), and can encode 2²¹ code points (32 planes) even if limited to 4 bytes.[3] Since Unicode limits the code points to the 17 planes that can be encoded by UTF-16, code points above 0x10FFFF are invalid in UTF-8 and UTF-32.

So you can see that you can put stuff into UTF-8 that isn't valid Unicode. Why? Because UTF-8 accommodates code points that Unicode doesn't even support.

UTF-8, even with a four byte limitation, supports 2²¹ code points, which is far more than 17 * 2^16

According to this table* UTF-8 should support:

2³¹ = 2,147,483,648 characters

However, RFC 3629 restricted the possible values, so now we're capped at 4 bytes, which gives us

2²¹ = 2,097,152 characters

Note that a good chunk of those characters are "reserved" for custom use, which is actually pretty handy for icon-fonts.

* Wikipedia used show a table with 6 bytes -- they've since updated the article.

2017-07-11: Corrected for double-counting the same code point encoded with multiple bytes

2,164,864 “characters” can be potentially coded by UTF-8.

This number is 2⁷ + 2¹¹ + 2¹⁶ + 2²¹, which comes from the way the encoding works:

1-byte chars have 7 bits for encoding 0xxxxxxx (0x00-0x7F)
2-byte chars have 11 bits for encoding 110xxxxx 10xxxxxx (0xC0-0xDF for the first byte; 0x80-0xBF for the second)
3-byte chars have 16 bits for encoding 1110xxxx 10xxxxxx 10xxxxxx (0xE0-0xEF for the first byte; 0x80-0xBF for continuation bytes)
4-byte chars have 21 bits for encoding 11110xxx 10xxxxxx 10xxxxxx 10xxxxxx (0xF0-0xF7 for the first byte; 0x80-0xBF for continuation bytes)

As you can see this is significantly larger than current Unicode (1,112,064 characters).

UPDATE

My initial calculation is wrong because it doesn't consider additional rules. See comments to this answer for more details.

Related questions
                            
                                Reading InputStream as UTF-8
                            
                                Using StringWriter for XML Serialization
                            
                                Java equivalent to JavaScript's encodeURIComponent that produces identical output?
                            
                                How to decode Unicode escape sequences like "\u00ed" to proper UTF-8 encoded characters?
                            
                                Pandas df.to_csv("file.csv" encode="utf-8") still gives trash characters for minus sign
                            
                                UnicodeDecodeError: 'ascii' codec can't decode byte 0xef in position 1
                            
                                How to make MySQL handle UTF-8 properly
                            
                                C# Convert string from UTF-8 to ISO-8859-1 (Latin1) H
                            
                                What is the proper way to URL encode Unicode characters?
                            
                                Example invalid utf8 string?
                            
                                ruby 1.9: invalid byte sequence in UTF-8
                            
                                SET NAMES utf8 in MySQL?
                            
                                Byte order mark screws up file reading in Java
                            
                                How does UTF-8 "variable-width encoding" work?
                            
                                UnicodeDecodeError: 'ascii' codec can't decode byte 0xd1 in position 2: ordinal not in range(128)
                            
                                How can I output UTF-8 from Perl?
                            
                                Why declare unicode by string in python?
                            
                                Serializing an object as UTF-8 XML in .NET
                            
                                "unmappable character for encoding" warning in Java
                            
                                Outlook autocleaning my line breaks and screwing up my email format

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How many characters can UTF-8 encode?

Tags:

character-encoding

ascii

utf-8

People also ask

Unicode vs UTF-8

Unicode

UTF-8

Recent Activity

Donate For Us