Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does `wprintf` replace non-ASCII characters with question mark `?` characters?

Tags:

c

unicode

utf-16

When running the following program (on Linux):

#include <stdio.h>
#include <wchar.h>

int main() {
    wprintf(L"Hej hopp: %lc\n", L'ä');
}

...the program outputs Hej hopp: ?, instead of the expected Hej hopp: ä output. Why is this?

like image 500
Per Lundberg Avatar asked Nov 18 '25 07:11

Per Lundberg


2 Answers

The answer can be alluded to in the man page for the setlocale(3) glibc function:

If locale is an empty string, "", each part of the locale that should be modified is set according to the environment variables. The details are implementation-dependent. For glibc, first (regardless of category), the environment variable LC_ALL is inspected, next the environment variable with the same name as the category (see the table above), and finally the environment variable LANG. The first existing environment variable is used.

[...]

On startup of the main program, the portable "C" locale is selected as default. A program may be made portable to all locales by calling:

setlocale(LC_ALL, "");

after program initialization [...]

In other words, if you edit the program like this:

#include <locale.h>
#include <stdio.h>
#include <wchar.h>

int main() {
    setlocale(LC_ALL, "");
    wprintf(L"Hej hopp: %lc\n", L'ä');
}

...it will produce the expected output: Hej hopp: ä (presuming that LANG or some of the other locale-related env variables are set to en_US.UTF-8 or another Unicode-supporting locale, which is often the case on a modern GNU/Linux-based system.

(Credits goes to this SO answer which helped me figure this out: https://stackoverflow.com/a/10760434/227779)


As a side note inspired by a comment from Luis Colorado, if all you want to do is print a wide character, you don't even need wprintf; the normal printf function works equally well for that (as long as setlocale has been called):

#include <locale.h>
#include <stdio.h>
#include <wchar.h>

int main() {
    setlocale(LC_ALL, "");
    printf("Hej hopp: %lc\n", L'ä');
}
like image 50
Per Lundberg Avatar answered Nov 19 '25 22:11

Per Lundberg


To allow you to include non ASCII characters in your source code, you need to use a locale that allows the compiler to interpret ä as a valid character. Normally what happens is that you use a different compiler setting for compiling (let's say some windows codepage, or in linux UTF-8, most frequently) this means that your code is not portable, as the locale settings will be handled, depending on the locale used to compile. Let's say you are using UTF-8 for compiling, so you are including codepoint U+000000e4 in your call to wprintf() and so, it prints the character correctly (it does in my system) if your output device (the terminal you are using to run the program) and the locale on output is the same as the one used for compiling.

Well, you are addressing a complex problem, as the locale you are using for compiling and the locale you are using for output once the source is compiled can be different. Also the terminal you are printing your output to must support your character set, if all of these match, then your output will be right.

Putting \U000000e4 as the character literal would make your character to always be the one cited, but this still ties your code to Unicode characters (while this time, the compilation encoding will not be affected, as all the unicode encodings decode the appropiate encoding to the same character) but still will be a problem if your locale is set, e.g. to ISO8859-1.

BTW, I see you tag your question as utf-16 which I presume is not the locale you use, mainly because all terminals support utf-8 encodings, but very few of them support utf-16 directly.

GCC and CLANG both accept (as I've tested) utf-8 encoding on input, and the conversion libraries (glibc) converts fina utf8 to wchar_t and viceversa. But if you are dealing with utf16, It will be necessary to know the exact situation you have.

The most probable case I guess that you can have is to have never initialized the system locale or to have it defaulted to C, POSIX or any other locale that uses ISO8859-something.

You can check your locale (the one you are using) with:

$ locale
LANG=es_ES.UTF-8
LC_CTYPE="es_ES.UTF-8"
LC_COLLATE="es_ES.UTF-8"
LC_TIME="es_ES.UTF-8"
LC_NUMERIC="es_ES.UTF-8"
LC_MONETARY="es_ES.UTF-8"
LC_MESSAGES="es_ES.UTF-8"
LC_ALL=
$ _

that will give you an output like the above is given to me.

In order to solve your problem you should give the compiler the values in the target supported encoding, and so get compiler independence (if you use the value \U000000e4 for Unicode character set, then you don't depend on the encoding to compile the program, but if you give the literal, you will be required by your compiler to understand the encoding used to compile (let's say you compile in an IBM machine with EBCDIC encoding; if your code is to run on Unicode character set, your codes will be different if you express them as the visual representations of the characters and not the actual codepoints)

BTW and last: your code compiles fine and executes fine (this is, produces the same output as specified in the input) when compiled in FreeBSD and run also in FreeBSD. There the compiler is not GCC, but CLANG, and the example works fine.

IMHO, GCC receives something weird from the conversion routines, based on incorrect encoding, but it's not known where it is failing, as you have the encoding of your file (which should encode your characters when reading the source, which is compiler implementation dependant) you need a unicode compatible stdio library (which can be not the case) and the characters must be supported by the output terminal.

GCC indicates in its documentation (Sect 4.16 of GCC manual) that these behaviours are dependant on the implementation of the C library, which is not part of GCC itself. IMHO, GCC should at least document how it interpretes the string and character literals, and if it uses actually the glibc library to read and interpret them. Because it doesn't complain when used extended characters for literals, nor specifies how are these extended characters interpreted.

like image 44
Luis Colorado Avatar answered Nov 19 '25 23:11

Luis Colorado



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!