We measured some performnace tests and I noticed that the CPU is running a lot of time in kernel mode. I'd like to know why is that.
The application: it's classic Azure Cloud service web role where Owin is listening under the IIS and Owin itself serves just static files that are cached in memory (so there should be only a little performance penalty and everyting should be pretty fast). The content is copied via await stream.CopyToAsync(response.Body) to output stream.
The test itself looks like this in gatling:
val openLoginSet = exec(http("ROOT")
      .get("/")
      .headers(Headers105Test2.headers_0)
      .resources(
        http("MED: arrow-down-small.png").get(uriIconAssets + "/arrow-down-small.png").headers(Headers105Test2.headers_1),
        http("MED: arrow-up-small.png").get(uriIconAssets + "/arrow-up-small.png").headers(Headers105Test2.headers_1),
        http("MED: close-medium.png").get(uriIconAssets + "/close-medium.png").headers(Headers105Test2.headers_1),
        http("MED: decline-medium.png").get(uriIconAssets + "/decline-medium.png").headers(Headers105Test2.headers_1),
        http("MED: help-medium.png").get(uriIconAssets + "/help-medium.png").headers(Headers105Test2.headers_1),
        http("MED: submit-medium.png").get(uriIconAssets + "/submit-medium.png").headers(Headers105Test2.headers_1),
        http("MED: delete-medium.png").get(uriIconAssets + "/delete-medium.png").headers(Headers105Test2.headers_1),
        http("MED: en-us.js").get("/en-us.js").headers(Headers105Test2.headers_8),
        http("MED: cloud_logo_big.png").get("/assets/cloud_logo_big.png").headers(Headers105Test2.headers_1),
        http("MED: favicon.ico").get("/favicon.ico").headers(Headers105Test2.headers_0))
val httpProtocol = http
  .baseURL("https://myurl.com")
  .inferHtmlResources()
val openLoginSenario = scenario("OpenOnly").exec(repeat(400, "n") {
    exec(openLoginSet).pause(3,6)
})
setUp(openLoginSenario.inject(rampUsers(150) over (3 minutes)))
  .protocols(httpProtocol)
  .maxDuration(3 minutes)
(I shortened the test to run 3 minutes just to catch data to show here) There are 3 computers that run this gatling test, each up to 150 concurrent threads, so 450 threads in total.
What I see is that there is a lot running code in kernel and W3wp process doesn't take most of the CPU:
Captured CPU when the test just started (the cpu is rising when new threads are added):

Captured CPU when the tests are nearly before end:

The kernel mode looks pretty bad and I'm not sure what might cause it. There should be almost no locks involved. When reading what else might cause the high kernel mode, I found that DPCs might cause it. So I captured some DPC data as well, but I'm not sure what's normal and what not. Anyway, the graph with DPC max times is also included in the sshot.
The vmbus.sys takes most significant time from all DPCs. That means that the Azure instance is not any bare metal (not suprising) and that the instance shares it's computional power with others. As I understand it, vmbus.sys is responsible for communication between e.g. network card itself and the hosted HyperV instance. Might running in HyperV be the main cause for low performance?
I'd like to know where to look and how to find out what causes the kernel mode in my situation.
Some more data:
Part of DPC data when the test started (taken in 30 sec):
Total = 17887 for module vmbus.sys
Elapsed Time, >        0 usecs AND <=        1 usecs,    137, or   0.77%
Elapsed Time, >        1 usecs AND <=        2 usecs,   2148, or  12.01%
Elapsed Time, >        2 usecs AND <=        4 usecs,   3941, or  22.03%
Elapsed Time, >        4 usecs AND <=        8 usecs,   2291, or  12.81%
Elapsed Time, >        8 usecs AND <=       16 usecs,   5182, or  28.97%
Elapsed Time, >       16 usecs AND <=       32 usecs,   3305, or  18.48%
Elapsed Time, >       32 usecs AND <=       64 usecs,    786, or   4.39%
Elapsed Time, >       64 usecs AND <=      128 usecs,     85, or   0.48%
Elapsed Time, >      128 usecs AND <=      256 usecs,      6, or   0.03%
Elapsed Time, >      256 usecs AND <=      512 usecs,      1, or   0.01%
Elapsed Time, >      512 usecs AND <=     1024 usecs,      2, or   0.01%
Elapsed Time, >     1024 usecs AND <=     2048 usecs,      0, or   0.00%
Elapsed Time, >     2048 usecs AND <=     4096 usecs,      1, or   0.01%
Elapsed Time, >     4096 usecs AND <=     8192 usecs,      2, or   0.01%
Total,                                                 17887
Part of DPC data when the test was ending (taken in 30 sec):
Total = 141796 for module vmbus.sys
Elapsed Time, >        0 usecs AND <=        1 usecs,   7703, or   5.43%
Elapsed Time, >        1 usecs AND <=        2 usecs,  21075, or  14.86%
Elapsed Time, >        2 usecs AND <=        4 usecs,  17301, or  12.20%
Elapsed Time, >        4 usecs AND <=        8 usecs,  38988, or  27.50%
Elapsed Time, >        8 usecs AND <=       16 usecs,  32028, or  22.59%
Elapsed Time, >       16 usecs AND <=       32 usecs,  11861, or   8.36%
Elapsed Time, >       32 usecs AND <=       64 usecs,   7034, or   4.96%
Elapsed Time, >       64 usecs AND <=      128 usecs,   5038, or   3.55%
Elapsed Time, >      128 usecs AND <=      256 usecs,    606, or   0.43%
Elapsed Time, >      256 usecs AND <=      512 usecs,     53, or   0.04%
Elapsed Time, >      512 usecs AND <=     1024 usecs,     26, or   0.02%
Elapsed Time, >     1024 usecs AND <=     2048 usecs,     11, or   0.01%
Elapsed Time, >     2048 usecs AND <=     4096 usecs,     10, or   0.01%
Elapsed Time, >     4096 usecs AND <=     8192 usecs,     53, or   0.04%
Elapsed Time, >     8192 usecs AND <=    16384 usecs,      3, or   0.00%
Elapsed Time, >    16384 usecs AND <=    32768 usecs,      1, or   0.00%
Elapsed Time, >    32768 usecs AND <=    65536 usecs,      5, or   0.00%
Total,                                                141796
% DPC Time from start to end of the test

We also suspected that we reached the network limits - so the tests 'download' so much data that the network adapter's limits are reached. This might be true during end of the test (when there are maximal number of threads), but this doesn't explain why the there is so much kernel mode time even at the beginning of the test.
Just to show how much data is sent - the volume of sent data (cyan line) is 2 orders of magnitude lower than capacity of the network adapter.

This might not help you directly. but we had some performance issues after moving an application to cloud. Please find the discussion here:
Huge performance drop after moving to Azure
After a lot of investigation finally we found out our issue was with the Transient Fault Handling mechanism. It used to go and read the web.config each time causing huge CPU usage which was not an issue with same code in non-cloud environment. We handled it by using a Singleton pattern around it.
Hope it helps you to find out if there is any such issues with the application.
Cheers. :)
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With