Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Finding out Unicode character name in .Net

Tags:

.net

unicode

Is there a way in .Net to find out, what Unicode name certain character has?

If not, is there a library that can do this?

like image 272
svick Avatar asked Sep 07 '25 15:09

svick


2 Answers

It's easier than ever now, as there's a package in nuget named Unicode Information

With this, you can just call:

UnicodeInfo.GetName(character)
like image 55
Rik Hemsley Avatar answered Sep 09 '25 22:09

Rik Hemsley


Here's a solution you can implement immediately, like copy/paste/compile.

First, download the Unicode database (UCD) here: http://www.unicode.org/Public/UNIDATA/UnicodeData.txt

Next, add this code to your project to read the UCD and create a Dictionary for looking up the name of a .NET char value:

string[] unicodedata = File.ReadAllLines( "UnicodeData.txt", Encoding.UTF8 );
Dictionary<char,string> charname_map = new Dictionary<char,string>( 65536 );
for (int i = 0; i < unicodedata.Length; i++)
{
    string[] fields = unicodedata[i].Split( ';' );
    int char_code = int.Parse( fields[0], NumberStyles.HexNumber );
    string char_name = fields[1];
    if (char_code >= 0 && char_code <= 0xFFFF) //UTF-16 BMP code points only
    {
        bool is_range = char_name.EndsWith( ", First>" );
        if (is_range) //add all characters within a specified range
        {
            char_name = char_name.Replace( ", First", String.Empty ); //remove range indicator from name
            fields = unicodedata[++i].Split( ';' );
            int end_char_code = int.Parse( fields[0], NumberStyles.HexNumber );
            if (!fields[1].EndsWith( ", Last>" ))
                throw new Exception( "Expected end-of-range indicator." );
            for (int code_in_range = char_code; code_in_range <= end_char_code; code_in_range++)
                charname_map.Add( (char)code_in_range, char_name );
        }
        else
            charname_map.Add( (char)char_code, char_name );
    }
}

The UnicodeData.txt file is UTF-8 encoded, and consists of one line of information for each Unicode code point. Each line contains a semi-colon-separated list of fields, where the first field is the Unicode code point in hexadecimal (with no prefixes) and the second field is the character name. Information about the file and the other fields each line contains can be found here: Infomation on the format of the UCD can be found here: http://www.unicode.org/reports/tr44/#Format_Conventions

Once you use the above code to build a mapping of characters to character names, you just retrieve them from the map with something like this:

char c = 'Â';
string character_name;
if (!charname_map.TryGetValue( c, out character_name ))
    character_name = "<Character Name Missing>"; //character not found in map
//character_name should now contain "LATIN CAPITAL LETTER A WITH CIRCUMFLEX";

I suggest embedding the UnicodeData.txt file in your application resources, and wrapping this code into a class, which loads and parses the file once in a static initializer. To make code more readable, you could implement an extension method in that class 'char' class like 'GetUnicodeName'. I've purposely restricted the values to the range 0 through 0xFFFF, because that's all a .NET UTF-16 char can hold. .NET char doesn't actually represent a true "character" (also called code point), but rather a Unicode UTF-16 code unit, since some "characters" actually require two code units. Such a pair of code units are called a high and low surrogate. Values above 0xFFFF (the largest value a 16-bit char can store) are outside the Basic Multilingual Plane (BMP), and according to UTF-16 encoding require two chars to encode. Individual codes that are part of a surrogate pair will end up with names like "Non Private Use High Surrogate", "Private Use High Surrogate", and "Low Surrogate" with this implementation.

like image 23
Triynko Avatar answered Sep 10 '25 00:09

Triynko