Intermittent Azure App Service "hang" issue - Thread Pool starvation?

Question

Update 2023-05-04

We created a logic app that pings a basic diagnostics endpoint every 30 seconds and logs the result. What we find is that the endpoint usually takes around 300-400ms to run, and then we see a sudden spike where it can take up to 50 seconds to return!

When we analyse the logs, we find ThreadPool.PendingWorkItemCount returns around 100 items. During "normal" operation, PendingWorkItemCount is always zero.

So it appears we're experiencing some form of thread pool exhaustion.

Is there any way to trace where these threads are coming from? For example, if there's some kind of background process or expired cache that gets updated periodically, how can we trace this?

The ThreadPool object providers very few public methods/properties that allow us to examine this in detail.

Example diagnostics:

{
  "start": "2023-05-04T12:17:03.0518943Z",
  "end": "2023-05-04T12:17:06.6382781Z",
  "threadCount": 8,
  "pendingWorkItemCount": 32,
  "workerThreads": 32762,
  "completionPortThreads": 1000,
  "maxWorkerThreads": 32767,
  "maxCompletionPortThreads": 1000
}

Original Issue

We're experiencing a strange issue with one of our Azure App Services. At various unpredictable points in the day the app will suddenly appear to hang for around 30-50 seconds, where no requests get serviced. It's as if we're waiting on a cold start.

It's an ASP.NET MVC .NET 7 application (C#) monolith. It has a DI service layer, but this isn't API based - all contained within one application. It uses Azure Redis extensively and has an Azure SQL backend. It also uses Azure Storage (Tables, Blobs and Queues) extensively.

The app uses the async-await pattern throughout. There should be virtually no synchronous calls or anything that obviously blocks a thread. We cannot find anything that 'locks' any resource for any period of time.

It doesn't really need to call any third party APIs, and we don't tend to use external CDNs much. Everything we need is pretty much inside the architecture described.

The MVC app is running on P2V2 (210 vCPU, 7GB RAM) and scaled out to two instances (session affinity on).

Redis instance is P1 Premium (6 GB cache).

Azure SQL is Standard S4 (200 DTUs), geo replicated between UK South (R/W) and UK West (R/O). In our application, we use both connection strings. Read-only queries are directed to UK West and Upsert/Deletion operations are directed to UK South, thereby "load-balancing" the SQL server.

During "normal" operation, the application is extremely quick, like in the low ms range. However, for no identifiable reason several times per day (perhaps 5 times) the application suddenly "hangs" on both instances for up to 50 seconds. During this time, the browser spins and nothing appears to be happening. Then all of a sudden the requests are serviced and it goes back to great performance. It's as if the app is "cold booting" but it's not - we were using it perfectly well seconds before.

During these periods, we check as many diagnostic sources as we can, but have found nothing to point towards this sudden hang, for example:

App Service CPU metrics on both machines don't go above 15%
No sudden spike in memory usage
SQL server DTU% typically 5-15% during these periods on both R/W and R/O servers
No spike in Redis memory usage and in the region of only 200MB
Redis server load typically 5-6%
No spikes in Ingress or Egress in Azure Storage data
Nothing of any interest in Application Insights
No spikes in errors, warnings, etc.
Nothing of interest in diagnostics event logs
No timeouts or any other latency issues that we can find
No background, scheduled or timed updates/CRON jobs running
Database queries are optimised and well indexed
Health checks remain at 100%
Instances are not rebooting, according to Azure logs. Uptime remains at 100%

All the pieces of architecture are well over-specced for our requirements at this stage.

There are no other obvious pieces of architecture that we can put our finger on, such as firewalls, etc.

The issue feels "internal" to MVC, .NET or the App Service itself. We cannot replicate the issue locally in development and we cannot predict when it will happen on production.

We've considered GC collection or potential database connection pool recycling, etc. but cannot find any data to suggest these things are issues.

Is it possible that Application Insights itself is causing the issue? Does it periodically dump or flush data/caches? It feels like something in the platform, hosting or framework is causing this.

We're a bit stumped. It's frustrating because other than these momentary spikes throughout the day, the app is running really well and super quick.

I've raised an issue with Azure Support and await their feedback but has anybody else has similar experiences with similar architectures? Do you have any suggestions that we could look at, any logs/diagnostics we could consider adding to trace where this issue may be coming from?

SimonGoldstone · Accepted Answer

The problem has now been solved. We changed three things:

SignalR - we hadn't activated the WebSockets option in App Service configuration. We were seeing a large number of requests until we did this. See this issue for more information: High number of SignalR requests compared to rest of application
There was an imbalance in SQL server tiers on one of our replicas. Whilst UK South and UK West were both S4, we had a third replica set at S1. So we removed this third replica, as it was not needed.
We decided to switch Application Insights off in the Azure blade.

As soon as these three changes were made, the issue resolved immediately.

Unfortunately, due to commercial constraints, we couldn't afford any more time to investigate the issue and isolate which of these changed fixed the problem. Hopefully this might give somebody with a similar problem something to look into further.

Intermittent Azure App Service "hang" issue - Thread Pool starvation?

Tags:

performance

entity-framework

asp.net-core-mvc

azure

azure-appservice

Update 2023-05-04

Original Issue

SimonGoldstone

1 Answers

SimonGoldstone

Recent Activity

Donate For Us

Intermittent Azure App Service "hang" issue - Thread Pool starvation?

Tags:

performance

entity-framework

asp.net-core-mvc

azure

azure-appservice

Update 2023-05-04

Original Issue

SimonGoldstone

1 Answers

SimonGoldstone

Related questions

Recent Activity

Donate For Us