We created a logic app that pings a basic diagnostics endpoint every 30 seconds and logs the result. What we find is that the endpoint usually takes around 300-400ms to run, and then we see a sudden spike where it can take up to 50 seconds to return!
When we analyse the logs, we find ThreadPool.PendingWorkItemCount
returns around 100 items. During "normal" operation, PendingWorkItemCount is always zero.
So it appears we're experiencing some form of thread pool exhaustion.
Is there any way to trace where these threads are coming from? For example, if there's some kind of background process or expired cache that gets updated periodically, how can we trace this?
The ThreadPool object providers very few public methods/properties that allow us to examine this in detail.
Example diagnostics:
{
"start": "2023-05-04T12:17:03.0518943Z",
"end": "2023-05-04T12:17:06.6382781Z",
"threadCount": 8,
"pendingWorkItemCount": 32,
"workerThreads": 32762,
"completionPortThreads": 1000,
"maxWorkerThreads": 32767,
"maxCompletionPortThreads": 1000
}
We're experiencing a strange issue with one of our Azure App Services. At various unpredictable points in the day the app will suddenly appear to hang for around 30-50 seconds, where no requests get serviced. It's as if we're waiting on a cold start.
It's an ASP.NET MVC .NET 7 application (C#) monolith. It has a DI service layer, but this isn't API based - all contained within one application. It uses Azure Redis extensively and has an Azure SQL backend. It also uses Azure Storage (Tables, Blobs and Queues) extensively.
The app uses the async-await pattern throughout. There should be virtually no synchronous calls or anything that obviously blocks a thread. We cannot find anything that 'locks' any resource for any period of time.
It doesn't really need to call any third party APIs, and we don't tend to use external CDNs much. Everything we need is pretty much inside the architecture described.
The MVC app is running on P2V2
(210 vCPU, 7GB RAM) and scaled out to two instances (session affinity on).
Redis instance is P1 Premium
(6 GB cache).
Azure SQL is Standard S4
(200 DTUs), geo replicated between UK South (R/W) and UK West (R/O). In our application, we use both connection strings. Read-only queries are directed to UK West and Upsert/Deletion operations are directed to UK South, thereby "load-balancing" the SQL server.
During "normal" operation, the application is extremely quick, like in the low ms range. However, for no identifiable reason several times per day (perhaps 5 times) the application suddenly "hangs" on both instances for up to 50 seconds. During this time, the browser spins and nothing appears to be happening. Then all of a sudden the requests are serviced and it goes back to great performance. It's as if the app is "cold booting" but it's not - we were using it perfectly well seconds before.
During these periods, we check as many diagnostic sources as we can, but have found nothing to point towards this sudden hang, for example:
All the pieces of architecture are well over-specced for our requirements at this stage.
There are no other obvious pieces of architecture that we can put our finger on, such as firewalls, etc.
The issue feels "internal" to MVC, .NET or the App Service itself. We cannot replicate the issue locally in development and we cannot predict when it will happen on production.
We've considered GC collection or potential database connection pool recycling, etc. but cannot find any data to suggest these things are issues.
Is it possible that Application Insights itself is causing the issue? Does it periodically dump or flush data/caches? It feels like something in the platform, hosting or framework is causing this.
We're a bit stumped. It's frustrating because other than these momentary spikes throughout the day, the app is running really well and super quick.
I've raised an issue with Azure Support and await their feedback but has anybody else has similar experiences with similar architectures? Do you have any suggestions that we could look at, any logs/diagnostics we could consider adding to trace where this issue may be coming from?
The problem has now been solved. We changed three things:
SignalR - we hadn't activated the WebSockets option in App Service configuration. We were seeing a large number of requests until we did this. See this issue for more information: High number of SignalR requests compared to rest of application
There was an imbalance in SQL server tiers on one of our replicas. Whilst UK South and UK West were both S4, we had a third replica set at S1. So we removed this third replica, as it was not needed.
We decided to switch Application Insights off in the Azure blade.
As soon as these three changes were made, the issue resolved immediately.
Unfortunately, due to commercial constraints, we couldn't afford any more time to investigate the issue and isolate which of these changed fixed the problem. Hopefully this might give somebody with a similar problem something to look into further.
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With