Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

UTF Encoding for "Ü" returns 3 bytes instead of the "real" unicode

I was playing around with the code mentioned in: https://stackoverflow.com/a/21575607/2416394 as I have issues writing proper utf8 xml with TinyXML.

Well, I need to encode the "LATIN CAPITAL LETTER U WITH DIAERESIS", which is Ü to be properly written to XML etc.

Here is the code take from the post above:

std::string codepage_str = "Ü";
int size = MultiByteToWideChar( CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
                                codepage_str.length(), nullptr, 0 );
std::wstring utf16_str( size, '\0' );
MultiByteToWideChar( CP_ACP, MB_COMPOSITE, codepage_str.c_str(),
                     codepage_str.length(), &utf16_str[ 0 ], size );

int utf8_size = WideCharToMultiByte( CP_UTF8, 0, utf16_str.c_str(),
                                     utf16_str.length(), nullptr, 0,
                                     nullptr, nullptr );
std::string utf8_str( utf8_size, '\0' );
WideCharToMultiByte( CP_UTF8, 0, utf16_str.c_str(),
                     utf16_str.length(), &utf8_str[ 0 ], utf8_size,
                     nullptr, nullptr );

The result is an std::string which has the size of 3 with the following bytes:

-       utf8_str    "Ü"   std::basic_string<char,std::char_traits<char>,std::allocator<char> >
        [size]  0x0000000000000003  unsigned __int64
        [capacity]  0x000000000000000f  unsigned __int64
        [0] 0x55 'U'    char
        [1] 0xcc 'Ì'    char
        [2] 0x88 'ˆ'    char

When I write it into an utf8 file. The hex values remain there: 0x55 0xCC 0x88 and Notepad++ shows me the proper char Ü.

However when I add another Ü to the file via Notepad++ and save it again then the newly written Ü is displayed as 0xC3 0x9C (which I've actually expected in the first place).

I do not understand, why I get a 3 byte representation of this character and not the expected unicode codepoint U+00DC.

Although Notepad++ displays it correctly, our proprietary system renders 0xC3 0x 9C as Ü and breaks on 0x55 0xCC 0x88 by rendering Ü not recognizing it as a two byte utf 8

like image 265
Samuel Avatar asked Nov 29 '25 02:11

Samuel


1 Answers

Unicode is complicated. There are at least two different ways to get Ü:

  1. LATIN CAPITAL LETTER U WITH DIAERESIS is Unicode codepoint U+00DC.

  2. LATIN CAPITAL LETTER U is Unicode codepoint U+0055, and COMBINING DIAERESIS is Unicode codepoint U+0308.

U+00DC and U+0055 U+0308 both display as Ü.

In UTF-8, Unicode codepoint U+00DC is encoded as 0xC3 0x9C, U+0055 is encoded as 0x55, and U+0308 is encoded as 0xCC 0x88.

Your proprietary system seems to have a bug.

Edit: to get what you expect, according to the MultiByteToWideChar() documentation, use MB_PRECOMPOSED instead of MB_COMPOSITE.

like image 119
RemcoGerlich Avatar answered Dec 01 '25 17:12

RemcoGerlich



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!