I have questions related to Unicode, printing escaped hexadecimal values in const char*
.
Now while writing a Unicode string, say you want to write "abc𤭢def₤ghi", where Unicode for 𤭢 is 0x24B62 and ₤ is 0x00A3. So I will have to compose the string as "abc0x24B62def0x00A3ghi". The 0x will consider all values that can be included in it. So if you want to print "abc𤭢62" the string will be "abc0x24B6262". Won't the entire string be taken as a 4-byte unicode (0x24B6262) value considered within 0x? How to solve this? How to print "abc𤭢62" and not abc(0x24B6262)?
const char* tmp = "abc\x0fdef";
. When I print using printf("\n string = %s", tmp);
then it will print abcdef. Where is 0f
here? I know the decimal value of \x0f will be stored in the string, i.e. 15, so when we try to print, 15 should be printed right? I mean, it should be "abc15def" but it prints only "abcdef".I think you may be unfamiliar with the concept of encodings, from reading your post.
For instance, you say "unicode of ... ₤ is 0x00A3". That is true - unicode codepoint U+00A3 is the pound sign. But 0x00A3 is not how you represent the pound sign in, for example, UTF-8 (a particular common encoding of Unicode). Take a look here to see what I mean. As you can see, the UTF-8 encoding of U+00A3 is the two bytes is 0xc2
, 0xa3
(in that order).
There are several things that happen between your call to printf()
and when something appears on your screen.
First, your program runs the code printf("abc\x0fdef")
, and that means that the following bytes in order, are written to stdout for your program:
0x61, 0x62, 0x63, 0x0f, 0x64, 0x65, 0x66
Note: I'm assuming your source code is ASCII (or UTF-8), which is very common. Technically, the interpretation of your source code's character set is implementation-defined, I believe.
Now, in order to see output, you will typically be running this program inside some kind of shell, and it has to eventually transform those bytes into visual characters. It does this by using an encoding. Again, something ASCII-compatible is common, such as UTF-8. On Windows, CP1252 is common.
And if that is the case, you get the following mapping:
0x61 - a
0x62 - b
0x63 - c
0x0f - the 'shift in' ASCII control code
0x64 - d
0x65 - e
0x66 - f
This prints out as "abcdef" because the 'shift in' control code is a non-printing character.
Note: The above can change depending on what exact character sets are involved, but ASCII or UTF-8 is very likely what you're dealing with unless you have an exotic setup.
If you have a UTF-8 compatible terminal, the following should print out "abc₤def", just as an example to get you started:
printf("abc\xc2\xa3def");
Make sense?
Update: To answer the question from your comment: you need to distinguish between a codepoint and the byte values for an encoding of that codepoint.
The Unicode standard defines 'codepoints' which are numerical values for characters. These are commonly written as U+XYZ where XYZ is a hexidecimal value. For instance, the character U+219e is LEFTWARDS TWO HEADED ARROW. This might also be written 0x219e. You would know from context that the writer is talking about a codepoint.
When you need to encode that codepoint (to print, or save to file, etc), you use an encoding, such as UTF-8. Note, if you used, for example, the UTF-32 encoding, every codepoint corresponds exactly to the encoded value. So in UTF-32, the codepoint U+219e would indeed be encoded simply as 0x219e. But other encodings will do things differently. UTF-8 will encode U+219e as the three bytes 0xE2 0x86 0x9E
.
Lastly, the \x
notation is simply how you write arbitrary byte values inside a C/C++ quoted string. If I write, in C source code, "\xff"
, then that string in memory will be the two bytes 0xff 0x00
(since it automatically gets a null terminator).
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With