I'm trying to understand how System.Text.Rune differs from the built-in char type, particularly in the context of handling Unicode characters beyond the Basic Multilingual Plane (BMP).
How does using Rune affect the processing and manipulation of non-ASCII or surrogate pair characters?
What are the advantages of using Rune over char for Unicode-aware string operations?
Could someone provide code examples that illustrate the key differences and benefits of using Rune?
I've been experimenting with some sample code to observe the practical differences, but I'd appreciate a clear explanation or demonstration from someone more experienced in Unicode handling in C#.
I attempted to utilize the provided code in order to discern the disparity between Runes and the char data type when it comes to processing Unicode characters.
using System;
using System.Text;
class Program
{
private static void Main()
{
const string input = "Hello 𝓦orld!"; // "𝓦" is a surrogate pair (outside BMP)
// Counting characters using 'char'
// Each UTF-16 code unit is counted as one character
Console.WriteLine("Using char:");
Console.WriteLine($"Character count: {input.Length}"); // May count surrogate pairs as two
// Counting characters using 'Rune'
// Each full Unicode code point is counted as one
Console.WriteLine("\nUsing Rune:");
var runeCount = CountRunes(input);
Console.WriteLine($"Character count: {runeCount}");
}
/// <summary>
/// Counts the number of Unicode scalar values (code points) in a string.
/// Uses Rune.EnumerateRunes which handles surrogate pairs correctly.
/// </summary>
private static int CountRunes(string input)
{
return input.EnumerateRunes().Count();
}
}
Here is Microsoft official introduction article that describes the terms I am using below.
Character encoding in .NET
Comments to your example:
Your string will result in 13 characters for the character counter and 12 characters for the Rune counter. If the word “character” here is what the user is seeing, then 12 is the expected number.
Windows use UTF-16 encoding, which means that the string is made up of 16-bit characters. Your “W” is a Supplementary code point (its hexadecimal value is 0x1D4E6 and above the 16-bit character range (up to 0xFFFF)) and therefore consists of two 16-bit characters in the string, called a "Surrogate Pair". This is what RUNE handles as one character while the “character counter” handles this as two separate characters.
As long as the string consists of characters inside the 16-bit range (up to 0xFFFF) then there are no differences between Char and Rune. But as soon as you go above the 16-bit range, there will be Surrogate pairs in the string, and Rune will handle this correctly. What Rune is handling as one character is called a Unicode scalar value. This is a 32-bit integer value that represents all the Unicode code points except for the code points that are used for the Surrogate pairs.
It's worth noting that a “character” (from a user perspective) can also be a combination of several Unicode scalar values called “Grapheme clusters” or, in Windows, “Text elements”. These combinations of scalar values are NOT handled by Rune as one character! There are other functions in the “System.Text” namespace that handles this type of characters.
In the introduction article, below header “Grapheme clusters”, there are good comparison and examples of Char, Rune and combination of scalar values.
The following code is a simplified version as suggested in Kay Zed’s comment with additional code that shows each character (from a user perspective) and it's code point(s) and code units.
public static void TestCharRune()
{
string input = "Hello 𝓦orld!";
// Using char data type
int charCount = input.Length;
Debug.WriteLine("Using char:");
Debug.WriteLine("Character count: " + charCount);
// Using Rune data type
int runeCount = input.EnumerateRunes().Count();
Debug.WriteLine("\nUsing Rune:");
Debug.WriteLine("Character count: " + runeCount);
Debug.WriteLine("");
int ix = 1;
foreach (var x in input.EnumerateRunes())
{
char[] chars = new char[2];
Span<char> cSpan = new Span<char>(chars,0,chars.Length);
x.EncodeToUtf16(cSpan);
Debug.WriteLine($"\nIx: {ix++}: {x} Code point: 0x{x.Value.ToString("x4").ToUpper()} (UTF-16: length: {x.Utf16SequenceLength} and code unit(s): 0x{((int)chars[0]).ToString("x4").ToUpper()}{(((int)chars[1]) != 0 ? " + 0x" + ((int)chars[1]).ToString("X4").ToUpper() : "")})");
}
}
Using char: Character count: 13
Using Rune: Character count: 12
Ix: 1: H Code point: 0x0048 (UTF-16: length: 1 and code unit(s): 0x0048)
Ix: 2: e Code point: 0x0065 (UTF-16: length: 1 and code unit(s): 0x0065)
Ix: 3: l Code point: 0x006C (UTF-16: length: 1 and code unit(s): 0x006C)
Ix: 4: l Code point: 0x006C (UTF-16: length: 1 and code unit(s): 0x006C)
Ix: 5: o Code point: 0x006F (UTF-16: length: 1 and code unit(s): 0x006F)
Ix: 6: Code point: 0x0020 (UTF-16: length: 1 and code unit(s): 0x0020)
Ix: 7: 𝓦 Code point: 0x1D4E6 (UTF-16: length: 2 and code unit(s): 0xD835 + 0xDCE6)
Ix: 8: o Code point: 0x006F (UTF-16: length: 1 and code unit(s): 0x006F)
Ix: 9: r Code point: 0x0072 (UTF-16: length: 1 and code unit(s): 0x0072)
Ix: 10: l Code point: 0x006C (UTF-16: length: 1 and code unit(s): 0x006C)
Ix: 11: d Code point: 0x0064 (UTF-16: length: 1 and code unit(s): 0x0064)
Ix: 12: ! Code point: 0x0021 (UTF-16: length: 1 and code unit(s): 0x0021)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With