Let's assume there's only C99 Standard paper and printf library function needs to be implemented according to this standard to work with UTF-16 encoding, could you please clarify the expected behavior for s conversion with precision specified?
C99 Standard (7.19.6.1) for s conversion says:
If no l length modifier is present, the argument shall be a pointer to the initial element of an array of character type. Characters from the array are written up to (but not including) the terminating null character. If the precision is specified, no more than that many bytes are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null character.
If an l length modifier is present, the argument shall be a pointer to the initial element of an array of wchar_t type. Wide characters from the array are converted to multibyte characters (each as if by a call to the wcrtomb function, with the conversion state described by an mbstate_t object initialized to zero before the first wide character is converted) up to and including a terminating null wide character. The resulting multibyte characters are written up to (but not including) the terminating null character (byte). If no precision is specified, the array shall contain a null wide character. If a precision is specified, no more than that many bytes are written (including shift sequences, if any), and the array shall contain a null wide character if, to equal the multibyte character sequence length given by the precision, the function would need to access a wide character one past the end of the array. In no case is a partial multibyte character written.
I don't quite understand this paragraph in general and the statement "If a precision is specified, no more than that many bytes are written" in particular.
For example, let's take UTF-16 string "TEST" (byte sequence: 0x54, 0x00, 0x45, 0x00, 0x53, 0x00, 0x54, 0x00).
What is expected to be written to the output buffer in the following cases:
Then there's also "Wide characters from the array are converted to multibyte characters". Does it mean UTF-16 should be converted to UTF-8 first? This is pretty strange in case I expect to work with UTF-16 only.
Converting a comment into a slightly expanded answer.
What is the value of CHAR_BIT in your implementation?
If CHAR_BIT == 8, you can't handle UTF-16 with %s; you'd use %ls and you'd pass a wchar_t * as the corresponding argument. You'd then have to read the second paragraph of the specification.
If CHAR_BIT == 16, then you can't have an odd number of octets in the data. You then need to know about how wchar_t relates to char (are they the same size? do they have the same signedness?) and interpret both paragraphs to come up with a uniform effect — unless you decided to have wchar_t represent UTF-32.
The key point is that UTF-16 cannot be handled as a C string if CHAR_BIT == 8 because there are too many useful characters that are encoded with one byte holding zero, but those zero bytes mark the end of a null-terminated string. To handle UTF-16, either the plain char type has to be a 16-bit (or larger) type (so CHAR_BIT > 8), or you have to use wchar_t (and sizeof(wchar_t) > sizeof(char)).
Note that the specification expects that wide characters will be converted to a suitable multibyte representation.
If you want wide characters output natively, you have to use the fwprintf() and related function from <wchar.h>, first defined in C99. The specification there has a lot in common with the specification of fprintf(), but there are (unsurprisingly) important differences.
7.29.2.1 The fwprintf function
…
s
If nollength modifier is present, the argument shall be a pointer to the initial element of a character array containing a multibyte character sequence beginning in the initial shift state. Characters from the array are converted as if by repeated calls to thembrtowcfunction, with the conversion state described by anmbstate_tobject initialized to zero before the first multibyte character is converted, and written up to (but not including) the terminating null wide character. If the precision is specified, no more than that many wide characters are written. If the precision is not specified or is greater than the size of the converted array, the converted array shall contain a null wide character.If an
llength modifier is present, the argument shall be a pointer to the initial element of an array ofwchar_ttype. Wide characters from the array are written up to (but not including) a terminating null wide character. If the precision is specified, no more than that many wide characters are written. If the precision is not specified or is greater than the size of the array, the array shall contain a null wide character.
wchar_t is not meant to be used for UTF-16, only for implementation-defined fixed-width encodings depending on the current locale. There's simply no sane way to support a variable-length encoding with the wide character API. Likewise, the multi-byte representation used by functions like printf or wcrtomb is implementation-defined. If you want to write portable code using Unicode, you can't rely on the wide character API. Use a library or roll your own code.
To answer your question: fprintf with the l modifier accepts a wide character string in the implementation-defined encoding specified by the current locale. If wchar_t is 16 bits, this encoding might be a bastardization of UTF-16, but as I mentioned above, there's no way to properly support UTF-16 surrogates. This wchar_t string is then converted to a multi-byte char string in an implementation-defined encoding. This might or might not be UTF-8. The specified precision limits the number of chars in the output string with the added restriction that no partial multi-byte characters are written.
Here's an example. Let's assume that the wide character encoding is UTF-32 with 32-bit wchar_t and that the multi-byte encoding is UTF-8 (like on Linux with an appropriate locale). The following code
wchar_t w[] = { 0x1F600, 0 }; // U+1F600 GRINNING FACE
printf("%.3ls", w);
will print nothing at all since the resulting UTF-8 sequence has four bytes. Only if you specify a precision of at least four
printf("%.4ls", w);
the character will be printed.
EDIT: To answer your second question, no, printf should never write a null character. The sentence only means that in certain cases, a null character is required to specify the end of the string and avoid buffer over-reads.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With