Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How do you convert a single surrogate character without a pair into its UTF-8 equivalent?

I am experimenting with wctomb in order to convert a wchar_t into its UTF-8 equivalent stored in a char[]. It works nicely, but not for surrogate characters ranging U+D800 to U+DFFF.

int ret;
// null-terminated
// VS gives a warning on wctomb() for buffer overrunning on char mb[4]={0} for some reason ...
char mb[5] = { 0 };
setlocale(LC_ALL, "en-US.utf8");
// Gives 0xE2 0xAA 0x96 just fine, wctomb returns 3
ret = wctomb(mb, L'\x2A96');
// expected 0xED 0xBA 0xA0, but wctomb returns -1, i.e. invalid character
ret = wctomb(mb, L'\xDEA0');

Is there another way to get the UTF-8 form of the surrogate character alone?
I also tried wctomb_s through errno_t and &ret but it just yields the same outcome ...

like image 850
amegyoushi Avatar asked Dec 02 '25 01:12

amegyoushi


1 Answers

Is there another way to get the UTF-8 form of the surrogate character alone?

No.
Single UTF-16 surrogates have no proper UTF-8 equivalent. Such UTF-8 encodings must be treated as an invalid byte sequence

If the source string lacks the proper pair (high surrogate, then low surrogate), then no proper UTF-8 equivalent exists.

Rather than pass along the ill formed data, consider detecting it and returning an error indication.

like image 190
chux - Reinstate Monica Avatar answered Dec 03 '25 14:12

chux - Reinstate Monica



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!