Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

How to print escaped hexadecimal in a string in C++?

Tags:

hex

unicode

I have questions related to Unicode, printing escaped hexadecimal values in const char*.

  1. From what I have understood, utf-8 includes 2-, 3- or 4-byte characters, ranging from pound symbol to Kanji characters. Within strings these are represented in hexadecimal values using \u as escape sequence. Also I have understood while using hexadecimal escape in a string, the characters whose value can be included in the escape will be included. For example say "abc\x0f0dab" will include 0f0dab to be considered within \x as hex even though you want only 0f0d to be considered.

Now while writing a Unicode string, say you want to write "abc𤭢def₤ghi", where Unicode for 𤭢 is 0x24B62 and ₤ is 0x00A3. So I will have to compose the string as "abc0x24B62def0x00A3ghi". The 0x will consider all values that can be included in it. So if you want to print "abc𤭢62" the string will be "abc0x24B6262". Won't the entire string be taken as a 4-byte unicode (0x24B6262) value considered within 0x? How to solve this? How to print "abc𤭢62" and not abc(0x24B6262)?

  1. I have a string const char* tmp = "abc\x0fdef";. When I print using printf("\n string = %s", tmp); then it will print abcdef. Where is 0f here? I know the decimal value of \x0f will be stored in the string, i.e. 15, so when we try to print, 15 should be printed right? I mean, it should be "abc15def" but it prints only "abcdef".
like image 738
ebdo Avatar asked Sep 05 '25 00:09

ebdo


1 Answers

I think you may be unfamiliar with the concept of encodings, from reading your post.

For instance, you say "unicode of ... ₤ is 0x00A3". That is true - unicode codepoint U+00A3 is the pound sign. But 0x00A3 is not how you represent the pound sign in, for example, UTF-8 (a particular common encoding of Unicode). Take a look here to see what I mean. As you can see, the UTF-8 encoding of U+00A3 is the two bytes is 0xc2, 0xa3 (in that order).

There are several things that happen between your call to printf() and when something appears on your screen.

First, your program runs the code printf("abc\x0fdef"), and that means that the following bytes in order, are written to stdout for your program:

0x61, 0x62, 0x63, 0x0f, 0x64, 0x65, 0x66

Note: I'm assuming your source code is ASCII (or UTF-8), which is very common. Technically, the interpretation of your source code's character set is implementation-defined, I believe.

Now, in order to see output, you will typically be running this program inside some kind of shell, and it has to eventually transform those bytes into visual characters. It does this by using an encoding. Again, something ASCII-compatible is common, such as UTF-8. On Windows, CP1252 is common.

And if that is the case, you get the following mapping:

0x61 - a
0x62 - b 
0x63 - c
0x0f - the 'shift in' ASCII control code
0x64 - d
0x65 - e
0x66 - f

This prints out as "abcdef" because the 'shift in' control code is a non-printing character.

Note: The above can change depending on what exact character sets are involved, but ASCII or UTF-8 is very likely what you're dealing with unless you have an exotic setup.

If you have a UTF-8 compatible terminal, the following should print out "abc₤def", just as an example to get you started:

printf("abc\xc2\xa3def");

Make sense?


Update: To answer the question from your comment: you need to distinguish between a codepoint and the byte values for an encoding of that codepoint.

The Unicode standard defines 'codepoints' which are numerical values for characters. These are commonly written as U+XYZ where XYZ is a hexidecimal value. For instance, the character U+219e is LEFTWARDS TWO HEADED ARROW. This might also be written 0x219e. You would know from context that the writer is talking about a codepoint.

When you need to encode that codepoint (to print, or save to file, etc), you use an encoding, such as UTF-8. Note, if you used, for example, the UTF-32 encoding, every codepoint corresponds exactly to the encoded value. So in UTF-32, the codepoint U+219e would indeed be encoded simply as 0x219e. But other encodings will do things differently. UTF-8 will encode U+219e as the three bytes 0xE2 0x86 0x9E.

Lastly, the \x notation is simply how you write arbitrary byte values inside a C/C++ quoted string. If I write, in C source code, "\xff", then that string in memory will be the two bytes 0xff 0x00 (since it automatically gets a null terminator).

like image 170
jwd Avatar answered Sep 07 '25 23:09

jwd