How to convert from utf-16 to utf-32 on Linux with std library?

Tags:

On MSVC converting utf-16 to utf-32 is easy - with C11's codecvt_utf16 locale facet. But in GCC (gcc (Debian 4.7.2-5) 4.7.2) seemingly this new feature hasn't been implemented yet. Is there a way to perform such conversion on Linux without iconv (preferrably using conversion tools of std library)?

383

asked May 28 '14 18:05

Al Berger

1 Answers

Decoding UTF-16 into UTF-32 is extremely easy.

You may want to detect at compile time the libc version you're using, and deploy your conversion routine if you detect a broken libc (without the functions you need).

Inputs:

a pointer to the source UTF-16 data (char16_t *, ushort *, -- for convenience UTF16 *);
its size;
a pointer to the UTF-32 data (char32_t *, uint * -- for convenience UTF32 *).

Code looks like:

void convert_utf16_to_utf32(const UTF16 *input, 
                            size_t input_size, 
                            UTF32 *output) 
{
    const UTF16 * const end = input + input_size;
    while (input < end) {
        const UTF16 uc = *input++;
        if (!is_surrogate(uc)) {
            *output++ = uc; 
        } else {
            if (is_high_surrogate(uc) && input < end && is_low_surrogate(*input))
                *output++ = surrogate_to_utf32(uc, *input++);
            else
                // ERROR
        }
    }
}

Error handling is left. You might want to insert a U+FFFD¹ into the stream and keep on going, or just bail out, really up to you. The auxiliary functions are trivial:

int is_surrogate(UTF16 uc) { return (uc - 0xd800u) < 2048u; }
int is_high_surrogate(UTF16 uc) { return (uc & 0xfffffc00) == 0xd800; }
int is_low_surrogate(UTF16 uc) { return (uc & 0xfffffc00) == 0xdc00; }

UTF32 surrogate_to_utf32(UTF16 high, UTF16 low) { 
    return (high << 10) + low - 0x35fdc00; 
}

¹ Cf. Unicode:

§ 3.9 Unicode Encoding Forms (Best Practices for Using U+FFFD)
§ 5.22 Best Practice for U+FFFD Substitution

² Also consider that the !is_surrogate(uc) branch is by far the most common (as well the non-error path in the second if), you might want to optimize that with __builtin_expect or similar.

116

answered Sep 19 '22 23:09

peppe

Related questions
                            
                                GStreamer C++ on Visual Studio 2010?
                            
                                Is this constructor correct?
                            
                                C++ - passing an array of unknown size [duplicate]
                            
                                Bit manipulation: keeping the common part at the left of the last different bit
                            
                                narrowing conversion from int to long unsigned int {} is ill-formed in C++11
                            
                                Size of string literal consisting of escaped characters
                            
                                Static pointer default value in c/c++ [duplicate]
                            
                                No operator << matches these operands
                            
                                What does " for (const auto &s : strs) {} " mean?
                            
                                Is the order for variadic template pack expansion defined in the standard?
                            
                                overloading left shift operator
                            
                                Get index of the matching item from vector c++
                            
                                "Launch failed. Binary not found." error on CDT Kepler Eclipse
                            
                                Using std::extent on std::array
                            
                                Can we implement a max or min macro, which can take variable arguments (more than two parameters )
                            
                                Can printf result in undefined behavior? [duplicate]
                            
                                Put an `unsigned char` into a `char`
                            
                                Declare a bit in C++
                            
                                Strange exception throw - assign: Operation not permitted
                            
                                LibCurl CURLOPT_URL not accepting string? C++

Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!

Donate Us With

How to convert from utf-16 to utf-32 on Linux with std library?

Tags:

c++

gcc

unicode

utf-16

Al Berger

People also ask

1 Answers

peppe

Recent Activity

Donate For Us