Logo Questions Linux Laravel Mysql Ubuntu Git Menu
 

StackExchange.Redis - Unexplainable time-out exception issue

We are experiencing issues in our integration within .NET Core 3.1 with the Azure Redis Cache. The exception thrown is

An unhandled exception has occurred while executing the request.","@l":"Error","@x":"StackExchange.Redis.RedisTimeoutException: Timeout awaiting response (outbound=1403KiB, inbound=5657KiB, 15000ms elapsed, timeout is 15000ms), command=EVAL, next: EVAL, inst: 0, qu: 0, qs: 709, aw: True, rs: ReadAsync, ws: Writing, in: 0, serverEndpoint: redis-scr-mns-dev.redis.cache.windows.net:6380, mc: 1/1/0, mgr: 10 of 10 available, clientName: xxxxxxxxxxxx, IOCP: (Busy=0,Free=1000,Min=4,Max=1000), WORKER: (Busy=7,Free=32760,Min=4,Max=32767), v: 2.1.58.34321 (Please take a look at this article for some common client-side issues that can cause timeouts: https://stackexchange.github.io/StackExchange.Redis/Timeouts)

Yes, I read the article already and we are using the StackExchange.Redis NuGet package, latest version available. Steps we already took were

  • Set the minimum threadpool count with several values (ThreadPool.SetMinThreads(short.MaxValue, short.MaxValue);)
  • Increase Redis timeout value from the default 5 seconds to 15 seconds (going any higher will not solve it I think to be honest, as you will read a bit further :))

What is the setup you ask?

  • .NET Core 3.1 REST API running on latest IIS with a 3 worker threads setting on a 4 core windows server with 16GB of RAM (don't see any extremes on the monitoring regarding cpu or memory)
  • Connected to Azure Redis Cache. Currently running a Basic C5 with high network bandwidth and 23GB of memory (it was a lower one before, so we tried scaling this one)
  • pushing request to an Azure Service Bus at the end (no problems there)

A Batch process is running and processing a couple of 10000's of API calls (several API's) of which the one mentioned above is crashing against the Redis Cache with the time out exception. The other api's are running correctly and not timing out, but are currently connecting to a different Redis cache (just to isolate this api's behavior) All api's and/or batch programs are using a custom NuGet package that has the cache implementation, so we are sure it can't be an implementation issue in that 1 api, all shared code.

How do we use the cache? Well, via dependency injection we inject ISharedCacheStore, which is just our own interface we put on top of IDistributedCache to make sure only asynchronous calls are available, together with the RedisCache, which is the implementation using Redis (the ISharedCacheStore is for future use of other caching mechanisms) We use Microsoft.Extensions.Caching.StackExchangeRedis, Version 3.1.5 and registration in startup is

 services.Configure<CacheConfiguration>(options => configuration?.GetSection("CacheConfiguration").Bind(options))
            .AddStackExchangeRedisCache(s =>
                {
                    s.Configuration = connectionString;
                })
            .AddTransient<IRedisCache, RedisCache>()
            .AddTransient<ISharedCacheStore, SharedCacheStore>();

We are out of ideas to be honest. We don't see an issue with the Redis Cache instance in Azure as this one is not even near it's top when we get the time-outs. Server load hits about 80% on the lower pricing plan and on the higher didn't even reach 10% on the current plan.

According to Insights, we have a 4000 cache hits per minute on the run we did, causing the about 10% server load.

UPDATE: It is worth mentioning that the batch and API are running on an on-premise environment today, instead of the cloud. Move to cloud is planned in the upcoming months. This also is applicable for other api's connecting to Redis Cache and NOT giving an issue

Comparison

  • Another Azure Redis cache is getting 45K hits a minute without giving any issue whatsoever (from on-premise)
  • This one is hitting the time-out mark nog even reaching 10K hits per minute
like image 419
Nico Degraef Avatar asked Sep 06 '25 00:09

Nico Degraef


1 Answers

There's a couple of possible things here:

  1. I don't know what that EVAL is doing; it could be that the Lua being executed is causing a blockage; the only way to know for sure would be to look at SLOWLOG, but I don't know whether this is exposed on Azure redis
  2. It could be that your payloads are saturating the available bandwidth - I don't know what you are transferring
  3. It could simply be a network/socket stall/break; they happen, especially with cloud - and the (relatively) high latency makes this especially painful
  4. We want to enable a new optional pooled (rather than multiplexed) model; this would in theory (the proof-of-concept worked great) avoid large backlogs, which means even if a socket fails: only that one call is impacted, rather than causing a cascade of failure; the limiting factor on this one is our time (and also, this needs to be balanced with any licensing implications from the redis provider; is there an upper bound on concurrent connections, for example)
  5. It could simply be a bug in the library code; if so, we're not seeing it here, but we don't use the same setup as you; we do what we can, but it is very hard to diagnose problems that we don't see, that only arise in someone else's at-cost setup that we can't readily replicate; plus ultimately: this isn't our day job :(

I don't think there's a simple "add this line and everything becomes great" answer here. These are non-trivial at-scale remote scenarios, that take a lot of investigation. And simply: the Azure folks don't pay for our time.

like image 92
Marc Gravell Avatar answered Sep 07 '25 20:09

Marc Gravell