Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Convert universal character name to UTF-8 in C

I need to convert universal character name (UCN) data from a database to UTF-8. Seems trivial, but I spent hours reading about unicode, UTF-8, wide strings, ... without any result.

As example, the following string needs to be converted from D\u00c3\u00bcsseldorf to Düsseldorf.

What I tried:

char str[] = "\u00c3\u00bc"; // corresponds to ü
size_t str_len = strlen(str);
for (i = 0; i < str_len; i++)
    printf("%02hhx ", str[i]);
printf("- %zu - %s\n", str_len, str); // prints "c3 83 c2 bc - 4 - ü"

c3 is correct, but the next 3 bytes are unexpected.
The compiler only considers the first part of the UCN (\u00c3).

wchar_t wcs[] = L"\u00c3\u00bc";
size_t wcs_len = wcslen(wcs);
for (i = 0; i < wcs_len; i++)
    printf("%02hhx ", wcs[i]);
printf("- %zu - %ls\n", wcs_len, wcs); // prints "c3 bc - 2 - ü"

Looks better.
The entire UCN is considered (c3 bc), but still no ü.

char str[] = "\xc3\xbc";
size_t str_len = strlen(str);
for (i = 0; i < str_len; i++)
    printf("%02hhx ", str[i]);
printf("- %zu %s\n", str_len, str); // prints "c3 bc - 2 ü"

This prints the ü, but I modified str from UCN to hex code.

What am I missing to get from \u00c3\u00bc to ü?

--- UPDATE ---

Like Rob Napier described, I have to change the initial string literal since it was badly/double encoded. I believe the only solution would be to manually change to "D\u00c3\u00bcsseldorf" to "Düsseldorf" or "D\u00fcsseldorf". Both ways require manual change.

Changing it to "D\xc3\xbcsseldorf" produces the correct result "Düsseldorf", but only by coincidence because the byte following the second byte injection (\xbc) is non-hex (the letter s). "AAA\xc3\xbcBBB" gives "AAAû" (0x41 0x41 0x41 0xc3 0xbb). Too bad that \x in a string literal doesn't stop after 1 byte. See this.

like image 373
geohei Avatar asked Nov 05 '25 08:11

geohei


2 Answers

char str[] = "\u00c3\u00bc"; // corresponds to ü

This is where you went wrong. This is not ü. This is ü, just as is being output.

  • LATIN CAPITAL LETTER A WITH TILDE
  • VULGAR FRACTION ONE QUARTER

The UCN for ü is \u00fc: LATIN SMALL LETTER U WITH DIAERESIS

$ uni print c3 bc
     CPoint  Dec    UTF8        HTML       Name (Cat)
'¼'  U+00BC  188    c2 bc       &frac14;   VULGAR FRACTION ONE QUARTER (Other_Number)
'Ã'  U+00C3  195    c3 83       &Atilde;   LATIN CAPITAL LETTER A WITH TILDE (Uppercase_Letter)

$ uni id ü
     CPoint  Dec    UTF8        HTML       Name (Cat)
'ü'  U+00FC  252    c3 bc       &uuml;     LATIN SMALL LETTER U WITH DIAERESIS (Lowercase_Letter)

Unicode code points (which are what UCN encode) assign a single number to each Unicode character. They are the identifier for the character, not the encoding.

What you've written here is the UTF-8 encoding of ü. UTF-8 is a way of writing down Unicode code points. Except for ASCII values (0-127), the UTF-8 bytes are always very different from the code point's value. (UTF-8 is possibly the most clever and useful text encoding ever devised. But it is not trivial to understand.)

If you want to hand-encode UTF-8, then the \x syntax is correct. You can inject arbitrary bytes into a C string that way. Generally you should prefer the \u00fc syntax when expressing a character, however.

The reason your first byte seemed correct is that the UTF-8 encoding of à is c3 83. "c3" is the first byte of the UTF-8 encoding of many modified Latin characters. Seeing a lot of c3 bytes is an easy way to detect Western European UTF-8 text.

like image 96
Rob Napier Avatar answered Nov 06 '25 22:11

Rob Napier


As correctly explained by @RobNapier, the initial encoding posted in the question is incorrect and results in double encoding if the compiler uses UTF-8 to encode unicode escapes in 8-bit strings.

To ensure UTF-8 encoding on all platforms, you should indeed use hex escape sequences as in "D\xc3\xbcsseldorf" and to avoid potential problems with subsequent characters in case they happen to be hex digits, you should use split the string string literal after the hex sequence:

    char city1[] = "D\xc3\xbc""sseldorf";
    char city2[] = "Saarbr\xc3\xbc""cken";

You could also use macros to avoid typos:

#define u_umlaut  "\xc3\xbc"

    char city1[] = "D" u_umlaut "sseldorf";
    char city2[] = "Saarbr" u_umlaut "cken";

This is only necessary if the source code does not use UTF-8 already of if the compiler is broken or misconfigured and converts the source character set to a different character set at compile time. With a modern properly configured system the source can be made more readable as:

    char city1[] = "Düsseldorf";
    char city2[] = "Saarbrücken";
like image 24
chqrlie Avatar answered Nov 06 '25 21:11

chqrlie



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!