Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Don't unescape unicode characters in System.Text.Json

I have the following JSON string that I read with the System.Text.Json library and I want to deserialize:

{
  "name": "\ud092\u00d0"
}

When I use the JsonSerializer.Deserialize method it of course decodes the \uXXXX to the appropriate characters.

What I want is to disable this and get the string as it is in the original JSON string: \ud092\u00d0. How can I accomplish that?

like image 761
Nikolay Kostov Avatar asked Oct 19 '25 04:10

Nikolay Kostov


1 Answers

You could create a custom JsonConverter<string> that uses the ValueSpan or ValueSequence properties of Utf8JsonReader to return the raw, "escaped" JSON string value when deserializing. That being said, the JSON standard offers two different escaping formats:

  • Unicode value escaping, which looks like \uXXXX, where XXXX is a 4-digit hexadecimal number corresponding to a UTF16 value.

  • Standard value escaping, which provides seven compact escapes \", \\, \/, \b, \f, \n and \t for ", \, /, backspace, formfeed, linefeed, carriage return and horizontal tab, respectively.

Your question indicates you don't want to unescape Unicode escapes to the appropriate characters -- but you don't indicate what you want to do about standard escapes.

If you want to retain all escapes when deserializing, you can do that quite simply with the following converter:

public class NoUnescapingStringConverter : JsonConverter<string>
{
    //TODO: decide whether to throw an exception on invalid UTF8 byte sequences.
    readonly static UTF8Encoding encoding = new(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes : true);

    public override string? Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options)
    {
        if (reader.TokenType != JsonTokenType.String)
            throw new JsonException();
        return reader.HasValueSequence 
            ? encoding.GetString(reader.ValueSequence)
            : encoding.GetString(reader.ValueSpan);
    }
    
    public override void Write(Utf8JsonWriter writer, string value, JsonSerializerOptions options) => 
        // TODO: decide how to write your string.
        writer.WriteRawValue("\""+value+"\"");
}

Then either apply to the name property of your model as follows:

public class Model
{
    [JsonConverter(typeof(NoUnescapingStringConverter))]
    public string name { get; set; } = "";
}

Or if you want to retain escapes for all string values throughout your data model, you can add it to JsonSerializerOptions.Converters as follows:

var options = new JsonSerializerOptions
{
    Converters = { new NoUnescapingStringConverter() },
};
var result = JsonSerializer.Deserialize<Model>(json, options);

Demo.NET 8 fiddle #1 here.

If you want to retain only Unicode escapes but unescape the standard escapes when deserializing, you will need to do that manually, making the converter somewhat more complicated:

public class NoUnescapingStringConverter : JsonConverter<string>
{
    //TODO: decide whether to throw an exception on invalid UTF8 byte sequences.
    readonly static UTF8Encoding encoding = new(encoderShouldEmitUTF8Identifier: false, throwOnInvalidBytes : true);

    public override string? Read(ref Utf8JsonReader reader, Type typeToConvert, JsonSerializerOptions options)
    {
        if (reader.TokenType != JsonTokenType.String)
            throw new JsonException();
        if (!reader.ValueIsEscaped)
            return reader.GetString();
        var sb = new StringBuilder();
        var decoder = encoding.GetDecoder();
        if (reader.HasValueSequence)
        {
            bool completed = true;
            foreach (var item in reader.ValueSequence)
                sb.Append(item.Span, decoder, false, out completed);
            if (!completed)
                sb.Append(Array.Empty<byte>(), decoder, true);
        }
        else
        {
            sb.Append(reader.ValueSpan, decoder, true);
        }
        
        return sb.UnescapeStandardJsonSequences().ToString();
    }
    
    public override void Write(Utf8JsonWriter writer, string value, JsonSerializerOptions options) => 
        // TODO: decide how to write your string.
        // writer.WriteRawValue("\""+value+"\"");
        writer.WriteStringValue(value);
}

public static class JsonExtensions
{
    public static StringBuilder UnescapeStandardJsonSequences(this StringBuilder sb)
    {
        int to = 0;
        for (int from = 0, length  = sb.Length; from < length; from++)
        {
            var ch = sb[from];
            if (ch == '\\' && from  < length - 1)
            {
                (ch, from) = sb[from + 1] switch
                {
                    '"' => ('"',  from+1),
                    '\\' => ('\\', from+1),
                    '/' => ('/',  from+1),
                    'b' => ('\b', from+1),
                    'f' => ('\f', from+1),
                    'n' => ('\n', from+1),
                    'r' => ('\r', from+1),
                    't' => ('\t', from+1),
                    _ => (ch, from),
                };
            }
            if (from != to)
                sb[to] = ch;
            to++;
        }
        sb.Length = to;
        return sb;
    }
    
    public static StringBuilder Append(this StringBuilder sb, ReadOnlySpan<byte> bytes, Decoder decoder, bool flush) =>
        sb.Append(bytes, decoder, flush, out var _);
        
    public static StringBuilder Append(this StringBuilder sb, ReadOnlySpan<byte> bytes, Decoder decoder, bool flush, out bool completed)
    {
        Span<char> chars = stackalloc char[256];
        int bytesUsed = -1;
        completed = true;

        while (bytes.Length > 0 && bytesUsed != 0)
        {
            decoder.Convert(bytes, chars, false, out bytesUsed, out var charsUsed, out completed);
            if (charsUsed > 0)
                sb.Append(chars.Slice(0, charsUsed));
            bytes = bytes.Slice(bytesUsed);
        }
        
        if (flush && !completed)
        {
            int charsUsed = decoder.GetChars(Array.Empty<byte>(), chars, true);
            if (charsUsed > 0)
                sb.Append(chars.Slice(0, charsUsed));
            completed = true;
        }
        
        return sb;
    }
}

Here I tried to minimize intermediate allocations (e.g. for temporary strings or Regex matches) but it was still necessary to construct an intermediate StringBuilder.

Warning: with this approach, multiple different JSON strings may get mapped to the same .NET string. This could happen when the standard escape \\ gets unescaped to \ and the subsequent four characters happen to be hex digits, therefore making the result string appear to be an unescaped Unicode escape. For instance, the two JSON strings "\\u0020" and "\u0020" will both get deserialized to \u0020.

Demo .NET 8 fiddle #2 here.

Notes:

  • Your question was asked against .NET 6 but that has since reached end-of-life, so I wrote this answer against .NET 8. I believe it should work in .NET 6 as well.

  • You don't indicate whether you will need to re-serialize your JSON strings. If you do, you will need to decide how to implement Write(). If you are using the first version of NoUnescapingStringConverter which retains all escapes, you could write the string back in its raw form as follows:

    public override void Write(Utf8JsonWriter writer, string value, JsonSerializerOptions options) => 
        writer.WriteRawValue("\""+value+"\"");
    

    But if you are using the second version which retains only Unicode escapes, writing the raw value could result in malformed JSON if standard escapes were previously unescaped. You would need to re-escape the standard escapes but not the Unicode escapes -- problematic due to the warning above, that multiple different JSON strings can get mapped onto to the same .NET string.

  • JSON string escaping is an encoding detail that ordinarily should be transparent to producers and consumers of JSON. Thus it is very unusual that you would need to capture raw, unescaped values in your data model. You might want to re-examine your requirements to see if you have some XY Problem that might better be addressed differently.

    For instance, is your real problem that you need to deserialize some JSON that is subtly malformed, because the serializing system used the wrong values for the \uXXXX hex values? (This is, for instance, a known problem with JSON generated by Facebook's "backup your data" feature, which incorrectly uses UTF8 values for the XXXX numbers rather than UTF16 values.) If so, see .NET 6 System.Text.Json.JsonSerializer Deserialize UTF-8 escaped string.

like image 187
dbc Avatar answered Oct 21 '25 17:10

dbc