Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

Stream to UTF8 String, without the byte[]

I have a stream whose next N bytes are a UTF8 encoded string. I want to create that string with the least overhead.

This works:

var bytes = new byte[n];
stream.Read(bytes, 0, n); // my actual code checks return value
var str = Encoding.UTF8.GetString(bytes);

In my benchmarking I see considerable time spent collecting garbage in the form of byte[] temporaries. If I can get rid of these, I can effectively halve my heap allocations.

The UTF8Encoding class doesn't have methods for working with streams.

I can use unsafe code, if that helps. I cannot reuse a byte[] buffer without ThreadLocal<byte[]> which seems to introduce more overhead than it alleviates. I do need to support UTF8 (ASCII won't cut it).

Is there an API or technique here that I'm missing?

like image 637
Drew Noakes Avatar asked Dec 08 '25 10:12

Drew Noakes


1 Answers

You can't avoid allocation of byte[] if you use the UTF8 encoding which is variable length. So the length of the resulting string can be determined only after reading all of these bytes.

Let's see the UTF8Encoding.GetString method:

public override unsafe String GetString(byte[] bytes, int index, int count)
{
    // Avoid problems with empty input buffer
    if (bytes.Length == 0) return String.Empty;

    fixed (byte* pBytes = bytes)
        return String.CreateStringFromEncoding(
            pBytes + index, count, this);
}

It calls the String.CreateStringFromEncoding method which gets the resulting string length first, then allocates it and fills it with characters without additional allocations. The UTF8Encoding.GetChars allocates nothing too.

unsafe static internal String CreateStringFromEncoding(
    byte* bytes, int byteLength, Encoding encoding)
{
    int stringLength = encoding.GetCharCount(bytes, byteLength, null);

    if (stringLength == 0)
        return String.Empty;

    String s = FastAllocateString(stringLength);
    fixed (char* pTempChars = &s.m_firstChar)
    {
        encoding.GetChars(bytes, byteLength, pTempChars, stringLength, null);
    }
}

If you will use a fixed length encoding, then you can allocate a string directly and use Encoding.GetChars on it. But you will lose performance on calling Stream.ReadByte multiple times since there's no Stream.Read that accepts byte* as an argument.

const int bufferSize = 256;

string str = new string('\0', n / bytesPerCharacter);
byte* bytes = stackalloc byte[bufferSize];

fixed (char* pinnedChars = str)
{
    char* chars = pinnedChars;

    for (int i = n; i >= 0; i -= bufferSize)
    {
        int byteCount = Math.Min(bufferSize, i);
        int charCount = byteCount / bytesPerCharacter;

        for (int j = 0; j < byteCount; ++j)
            bytes[j] = (byte)stream.ReadByte();

        encoding.GetChars(bytes, byteCount, chars, charCount);

        chars += charCount;
    }
}

So you already use the better way to get strings. The only thing that could be done in this situation is implementing the ByteArrayCache class. It should be similar to StringBuilderCache.

public static class ByteArrayCache
{
    [ThreadStatic]
    private static byte[] cachedInstance;

    private const int maxArraySize = 1024;

    public static byte[] Acquire(int size)
    {
        if (size <= maxArraySize)
        {
            byte[] instance = cachedInstance;

            if (cachedInstance != null && cachedInstance.Length >= size)
            {
                cachedInstance = null;
                return instance;
            }
        }

        return new byte[size];
    }

    public static void Release(byte[] array)
    {
        if ((array != null && array.Length <= maxArraySize) &&
            (cachedInstance == null || cachedInstance.Length < array.Length))
        {
            cachedInstance = array;
        }
    }
}

Usage:

var bytes = ByteArrayCache.Acquire(n);
stream.Read(bytes, 0, n);

var str = Encoding.UTF8.GetString(bytes);
ByteArrayCache.Release(bytes);
like image 124
Yoh Deadfall Avatar answered Dec 10 '25 00:12

Yoh Deadfall