Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Why does MSVC's std::print corrupt long unicode strings when printing with utf-8?

If I compile the following code in Visual Studio with the /utf-8 flag enabled:

#include <print>

int main() {
    std::println("{}", "▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊");
}

I get the following in my console:

▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊���▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊��▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊▊

enter image description here

I have no idea what's happening here. Maybe internal buffering is splitting the code units mid-codepoint and causing it to become desynchronized?

EDIT: Just for Mr. Kanavos.

enter image description here

There is in fact a problem here.

like image 517
Chris_F Avatar asked Oct 29 '25 17:10

Chris_F


1 Answers

This is an open issue in Windows Terminal.

It occurs when the encoding for a single codepoint is split across the end of one output operation and the beginning of the next, which can happen because of buffering.

The decoder doesn't maintain the state necessary to finish decoding the incomplete character from the previous write. There're some additional wrinkles, but that's the gist of the problem. The folks responding to the bug report seemed to understand those wrinkles, so I didn't look into them further.

Unicode replacement characters are emitted for the continuation bytes at the beginning of the second write operation until the decoder resyncs. Offhand, I don't remember if there's also a replacement character for the stranded head at the end of the previous write.

Clarification: The issue is reported in the Windows Terminal repo and known to that team, but the root problem does not lie in the Windows Terminal code.

I re-read several of the related bug reports and this is my understanding:

The console layer forwards to Terminal through any of a few APIs. There's (at least) a byte-oriented one and a wide character-oriented one. The loss of decoding state across writes is happening in the console layer, which is using MultiByteToWideChar (or equivalent) to convert the UTF-8 stream to wide characters before passing it on to Terminal.

If I understand correctly, UTF-8 passed directly to Terminal's byte-oriented interface should be decoded properly, even if the encoding of a single codepoint straddles two calls. I'm under the impression that simply updating the console layer to use the byte-oriented interface would break backward compatibility for some MBCS or DBCS code pages. A backward compatible fix would be more work, and thus I assume it hasn't been prioritized.

Unlike Terminal, the console layer source code is not public.

An application could work around the problem by doing its own buffering to ensure that a codepoint is never split across output operations. That's obviously not as convenient as relying on your language's buffered i/o.

like image 95
Adrian McCarthy Avatar answered Nov 01 '25 05:11

Adrian McCarthy



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!