Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Unicode strings in process memory

What is the most preferred format of unicode strings in memory when they are being processed? And why?

I am implementing a programming language by producing an executable file image for it. Obviously a working programming language implementation requires a protocol for processing strings.

I've thought about using dynamic arrays as the basis for strings because they are very simple to implement and very efficient for short strings. I just have no idea about the best possible format for characters when using strings in this manner.

like image 342
Cheery Avatar asked Nov 28 '25 22:11

Cheery


1 Answers

UTF16 is the most widely used format.

The advantage of UTF16 over UTF8 is that, despite being less compact, every character has a constant size of 2bytes (16bits) - as long as you don't use surrogates (when sticking to 2bytes chars, the encoding is called UCS-2).

In UTF8 there is only a small set of characters coded on 1bytes, others are up 4 bytes. This makes character processing less direct and more error prone.

Of course using Unicode is preferred since it enables to handle international characters.

like image 165
thinkbeforecoding Avatar answered Dec 02 '25 05:12

thinkbeforecoding



Donate For Us

If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!