I'm attempting to debug a production issue with a windows service that has a tendency to fall over rapidly once a number of concurrent connections are active. Through the magic of a core dump and DebugDiag I was able to discover that there was a pending GC operation, which could not start until several threads with Preemptive GC disabled completed their work.
Here is a sample thread dump from WinDbg showing the offending threads:
26   6e  1444 00..440   8009222 Disabled 00..200:00..f88 00..7a0     0 MTA (Threadpool Completion Port)
27   c1  1a0c 00..fe0   8009222 Disabled 00..e90:00..f88 00..7a0     0 MTA (Threadpool Completion Port)
28   b5  17bc 00..6f0   8009222 Disabled 00..268:00..f88 00..7a0     0 MTA (Threadpool Completion Port)
29   89  1f1c 00..ab0   8009222 Disabled 00..a30:00..f88 00..7a0     0 MTA (Threadpool Completion Port)
30   ac  2340 00..f70   8009220 Disabled 00..d00:00..d08 00..7a0     1 MTA (GC) (Threadpool Completion Port)
31   88  1b64 00..fd0   8009220 Enabled  00..b28:00..b48 00..7a0     0 MTA (Threadpool Completion Port)
So here you can see several threads which have preemptive GC disabled (threads 26,27,28,29) and one (thread 30) which is waiting on those threads to perform the GC.
My Google-fu lead me to this blog post which describes what sounds like a similar problem, only in my case there was no XML involved. It gave me enough information to know where to dig though, and eventually I discovered that one of the common features of the threads with preemptive GC disabled was a stack trace that looked like this at the top:
ntdll!NtWaitForSingleObject+a
ntdll!RtlpWaitOnCriticalSection+e8
ntdll!RtlEnterCriticalSection+d1
ntdll!RtlpLookupDynamicFunctionEntry+58
ntdll!RtlLookupFunctionEntry+a3
clr!JIT_TailCall+db
...
DebugDiag also warned me about the CriticalSection, and it just so happens that the threads with the JIT_TailCall are the also the only threads with RtlEnterCriticalSection
So my question is: Is it in fact the .tail instruction that is causing this deadlock? And if so: What can I do about it?
I can disable tailcalls on my .fsproj files but it looks like at least one of these is coming from FSharp.Core.dll and some spelunking in the decompiler seems to confirm the existence of the .tail instruction. So I don't know that chaning the project config will remove all of the .tail instructions.
Has anyone dealt with something like this before?
Update: Some more info which could be useful.
Here is the output of !locks for this dump:
!locks
CritSec +401680 at 0000000000401680
WaiterWoken        No
LockCount          0
RecursionCount     1
OwningThread       2340
EntryCount         0
ContentionCount    bf
*** Locked
Scanned 1657 critical sections
Thread 2340 is the thread which has started the GC (thread 30 in the partial list I included above).
And !syncblk is only showing items owned by the ZooKeeper client (which, while annoying, is not involved in any of the stacks which are keeping GC from starting)
!syncblk
Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner
11 0000000019721a38            1         1 0000000019766e20 638   7   0000000000fb2950 System.Collections.Generic.LinkedList`1[[ZooKeeperNet.Packet, ZooKeeperNet]]
    Waiting threads:
18 0000000019721c68            1         1 000000001ae71420 8ac  13   00000000012defc8 System.Collections.Generic.LinkedList`1[[ZooKeeperNet.Packet, ZooKeeperNet]]
    Waiting threads:
-----------------------------
Total           64
CCW             0
RCW             0
ComClassFactory 0
Free            5
I doubt tailcalls are the problem (otherwise, I suspect lots more F# users would have hit this issue). From the call stack, it looks like your code is waiting on a critical section, which seems far more likely to be the source of the issue... Any idea what synchronization primitives your code might be relying on?
It's perhaps a bit late and although the problem you are describing seems a bit different from the one I had, the call trace you gave suggests there might be some common ground.
You can find more details in my answer to my own question, but in short it comes down to the combination of Windows 7 and .NET 4.0-4.5 that makes the tail recursion in F# problematic, causing excessive locking. Updating .NET to 4.6 or upgrading to Windows 8 solves the problem.
Additionally, because you are having a problem with garbage collection, you might want to look at using server garbage collection. This is one of the things I did before being finding the above issue and it solved a big part of the performance issues we were experiencing. All that is needed is the following in your app.config:
<configuration>
  ...
  <runtime>
    ...
    <gcServer enabled="true"/>
    ...
  </runtime>
  ...
</configuration>
If you love us? You can donate to us via Paypal or buy me a coffee so we can maintain and grow! Thank you!
Donate Us With