I have the following string:
友𠂇又
The corresponding UTF-16 representation (little-endian) is
CB 53 40 D8 87 DC C8 53
\___/ \_________/ \___/
  友       𠂇       又
"友𠂇又".Length returns 4, because the string is stored as 4 2-byte characters by the CLR.
How do I measure the length of my string? How do I split it into { "友", "𠂇", "又" }?
As documented:
The
Lengthproperty returns the number ofCharobjects in this instance, not the number of Unicode characters. The reason is that a Unicode character might be represented by more than oneChar. Use theSystem.Globalization.StringInfoclass to work with each Unicode character instead of each Char.
Getting length:
new System.Globalization.StringInfo("友𠂇又").LengthInTextElements
Getting each Unicode character is documented here, but it's much more convenient to make an extension method:
public static IEnumerable<string> TextElements(this string s) {
    var en = System.Globalization.StringInfo.GetTextElementEnumerator(s);
    while (en.MoveNext())
    {
        yield return en.GetTextElement();
    }
}
and use it in a foreach or in a LINQ statement:
foreach (string segment in "友𠂇又".TextElements())
{
    Console.WriteLine(segment);
}
which also can be used for length:
Console.WriteLine("友𠂇又".TextElements().Count());
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With