How do you convert a single surrogate character without a pair into its UTF-8 equivalent?

Question

I am experimenting with wctomb in order to convert a wchar_t into its UTF-8 equivalent stored in a char[]. It works nicely, but not for surrogate characters ranging U+D800 to U+DFFF.

int ret;
// null-terminated
// VS gives a warning on wctomb() for buffer overrunning on char mb[4]={0} for some reason ...
char mb[5] = { 0 };
setlocale(LC_ALL, "en-US.utf8");
// Gives 0xE2 0xAA 0x96 just fine, wctomb returns 3
ret = wctomb(mb, L'\x2A96');
// expected 0xED 0xBA 0xA0, but wctomb returns -1, i.e. invalid character
ret = wctomb(mb, L'\xDEA0');

Is there another way to get the UTF-8 form of the surrogate character alone?
I also tried wctomb_s through errno_t and &ret but it just yields the same outcome ...

chux - Reinstate Monica · Accepted Answer

Is there another way to get the UTF-8 form of the surrogate character alone?

No.
Single UTF-16 surrogates have no proper UTF-8 equivalent. Such UTF-8 encodings must be treated as an invalid byte sequence

If the source string lacks the proper pair (high surrogate, then low surrogate), then no proper UTF-8 equivalent exists.

Rather than pass along the ill formed data, consider detecting it and returning an error indication.

How do you convert a single surrogate character without a pair into its UTF-8 equivalent?

Tags:

c

utf-8

c99

surrogate-pairs

amegyoushi

1 Answers

chux - Reinstate Monica

Recent Activity

Donate For Us

How do you convert a single surrogate character without a pair into its UTF-8 equivalent?

Tags:

c

utf-8

c99

surrogate-pairs

amegyoushi

1 Answers

chux - Reinstate Monica

Related questions

Recent Activity

Donate For Us