From this answer I learned that in C++17 we can open std::fstream using a UTF-8 path via std::filesystem::u8path. But in C++20 this function is deprecated, and we are supposed to pass const char8_t* to std::filesystem::path constructor instead.
Here comes the problem: although we can legally convert (via reinterpret_cast) any pointer to const char*, we can't do backwards: from const char* to e.g. const char8_t* (it would break strict aliasing rules). So if we have some external API returning a char-based UTF-8 representation of the filename (e.g. from a library written in C), we can't safely convert the pointer to char8_t-based one.
So, how are we supposed to convert such char-based view of UTF-8 strings to char8_t-based view of them?
Disclaimer: I'm the author of the P0482 proposal that introduced char8_t and deprecated u8path.
Your observations are correct; it is not permissible to use reinterpret_cast to produce a char8_t pointer to a sequence of char objects. This is discussed further at https://stackoverflow.com/a/57453713/11634221.
Though std::filesystem::u8path has been deprecated in C++20, there are no plans for its imminent removal; you can continue to use it. Further, P1423 corrects an unintended consequence of the changes in P0482 and permits it to be called with ranges of both char and char8_t in C++20. As far as I'm aware, no implementors have annotated std::filesystem::u8path as deprecated (I don't know if any plan to do so).
There is no (well-formed) way to produce a char8_t pointer based view of a sequence of char. It is possible to write a range/iterator adapter that, internally, converts the individual char values to char8_t on iterator dereference. Such an adapter could satisfy the requirements of the C++17 and C++20 random access iterator requirements for a non-mutable iterator (it can't satisfy requirements for a mutable iterator because the dereference operation wouldn't be able to provide an lvalue, nor could it satisfy requirements for a contiguous iterator). Such an adapter would suffice for calls to the std::filesystem::path constructors that accept ranges. Hmm, this might be a useful enough adapter to add to https://github.com/tahonermann/char8_t-remediation.
An alternative to a view over the underlying char data is, of course, to copy it, but I can appreciate why doing so might be considered undesirable (we already tend to do a lot of copying when working with std::filesystem::path).
From this character types reference about char8_t:
It has the same size, signedness, and alignment as
unsigned char(and. therefore, the same size and alignment ascharandsigned char), but is a distinct type.
Because it's a distinct type you can not convert from const char* to const char8_t* without breaking strict aliasing. But for all practical purposes, since char8_t is basically a unsigned char you can use reinterpret_cast to convert the pointer. It's wrong but will work.
For proper correctness either use char8_t to begin with, or copy the original characters into a char8_t buffer (or std::u8string).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With