Let's take this table with characters and HEX encodings in Unicode and UTF-8.
Does anyone know how it is possible to convert UTF-8 hex to Unicode code point using only math operations?
E.g. let's take the first row. Given 227, 129 130 how to get 12354?
Is there any simple way to do it by using only math operations?
| Unicode code point | UTF-8 | Char |
|---|---|---|
| 30 42 (12354) | e3 (227) 81 (129) 82 (130) | あ |
| 30 44 (12356) | e3 (227) 81 (129) 84 (132) | い |
| 30 46 (12358) | e3 (227) 81 (129) 86 (134) | う |
* Source: https://www.utf8-chartable.de/unicode-utf8-table.pl?start=12288&unicodeinhtml=hex
This video is the perfect source (watch from 6:15), but here is its summary and code sample in golang. With letters I mark bits taken from UTF-8 bytes, hopefully it makes sense. When you understand the logic it's easy to apply bitwise operators):
| Bytes | Char | UTF-8 bytes | Unicode code point | Explanation |
|---|---|---|---|---|
| 1-byte (ASCII) | E | 1. 0xxx xxxx 0100 0101 or 0x45 |
1. 0xxx xxxx 0100 0101 or U+0045 |
no conversion needed, the same value in UTF-8 and unicode code point |
| 2-byte | Ê | 1. 110x xxxx 2. 10yy yyyy 1100 0011 1000 1010 or 0xC38A |
0xxx xxyy yyyy 0000 1100 1010 or U+00CA |
1. First 5 bits of the 1st byte 2. First 6 bits of the 2nd byte |
| 3-byte | あ | 1. 1110 xxxx 2. 10yy yyyy 3. 10zz zzzz 1110 0011 1000 0001 1000 0010 or 0xE38182 |
xxxx yyyy yyzz zzzz 0011 0000 0100 0010 or U+3042 |
1. First 4 bits of the 1st byte 2. First 6 bits of the 2nd byte 3. First 6 bits of the 3rd byte |
| 4-byte | 𐄟 | 1. 1111 0xxx 2. 10yy yyyy 3. 10zz zzzz 4. 10ww wwww 1111 0000 1001 0000 1000 0100 1001 1111 or 0xF090_849F |
000x xxyy yyyy zzzz zzww wwww 0000 0001 0000 0001 0001 1111 or U+1011F |
1. First 3 bits of the 1st byte 2. First 6 bits of the 2nd byte 3. First 6 bits of the 3rd byte 4. First 6 bits of the 4th byte |
func get(byte1 byte, byte2 byte) {
int1 := uint16(byte1 & 0b_0001_1111) << 6
int2 := uint16(byte2 & 0b_0011_111)
return rune(int1 + int2)
}
func get(byte1 byte, byte2 byte, byte3 byte) {
int1 := uint16(byte1 & 0b_0000_1111) << 12
int2 := uint16(byte2 & 0b_0011_111) << 6
int3 := uint16(byte3 & 0b_0011_111)
return rune(int1 + int2 + int3)
}
func get(byte1 byte, byte2 byte, byte3 byt3, byte4 byte) {
int1 := uint(byte1 & 0b_0000_1111) << 18
int2 := uint(byte2 & 0b_0011_111) << 12
int3 := uint(byte3 & 0b_0011_111) << 6
int4 := uint(byte4 & 0b_0011_111)
return rune(int1 + int2 + int3 + int4)
}
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With