How I Shot My Own Self in the Foot with Semaphores….

…and the lessons I’ve learned. (Seemed like it needed a continuation, yeah?)

So, one of my many (now backlogged) personal projects is to write a Windows Service that copies files from a source directory to a target directory (or directories) and then encrypts the files via the LSA account (if so configured in the app config file). The idea was bourne out of two premises:

  1. As someone who doesn’t currently have a cloud provider, I still need disaster recovery and, if I’m in this position, surely there’s at least one other person in the world in the same boat. Maybe.
  2. This is, perhaps, the more crucial part. We take for granted that we have pretty fat pipes with which to fit all of the cats that we could ever want into the tubes and still have room left over. Other areas of the world still don’t have this luxury. What opportunities are available to them, now that we’re in a mobile-first, cloud-first world?

So, since we’re dealing with disks, I wanted to prevent it being an attack surface and/or prevent disk queuing or thrashing. If you’ve ever seen my project for the Windows Service to download the Bing Images, then you’re already familiar with my preferred route of semaphore use, which is the SemaphoreSlim.

So, I write my service, compile it (via Visual Studio), and install on my machine (via InstallUtil). I fire it up and drop a file into the source folder and, when the timer hits, it copies and encrypts the files to the destinations. O.k., so far, so good. I add a second file and nothing happens. Wait a few more minutes, still nothing.

O.k., I shot myself in the foot, somewhere, but nowhere I need to solve the crime, where I’m also the murderer; which isn’t, necessarily, the easiest task in the world.

So, I check task manager. The service is still running, so we didn’t hit a second-chance exception and the service terminated. O.k., I’m not a total idiot, at least (I hope).

I check the Wait Chain Traversal in Task Manager and it says the process is running fine, nothing is blocked.

O.k…. That makes even less sense. Time to debug it. I create a dump of the process in task manager and crack it open in WinDbg.

I’m using a tool called PDE, which you can get from the public OneDrive shared by Andrew Richards here. I’ll be using it to look for a thread with the most amount of frames. Why? Well, just because nothing is happening, it doesn’t mean that nothing is actually happening.

And, wouldn’t you know it, it looks like a dead-lock:

0:000> !pde.deep 50

6 TID:0718 kb kbn kbnL kn knL kpn kPn
# Child-SP RetAddr Call Site
00 00000040`6d9fd158 00007ff8`a198c7ee ntdll!NtWaitForMultipleObjects+0x14
01 00000040`6d9fd160 00007ff8`9720264e KERNELBASE!WaitForMultipleObjectsEx+0xfe
02 00000040`6d9fd460 00007ff8`972024a8 clr!WaitForMultipleObjectsEx_SO_TOLERANT+0x62
03 00000040`6d9fd4c0 00007ff8`972022a1 clr!Thread::DoAppropriateWaitWorker+0x1e4
04 00000040`6d9fd5c0 00007ff8`971ed336 clr!Thread::DoAppropriateWait+0x7d
05 00000040`6d9fd640 00007ff8`9717703f clr!CLREventBase::WaitEx+0xc4
06 00000040`6d9fd6d0 00007ff8`9717700c clr!Thread::Block+0x27
07 00000040`6d9fd700 00007ff8`97176db1 clr!SyncBlock::Wait+0x19d
08 00000040`6d9fd840 00007ff8`8a05bd08 clr!ObjectNative::WaitTimeout+0xe1
09 00000040`6d9fd9c0 00007ff8`8a05bac9 mscorlib_ni!System.Threading.SemaphoreSlim.WaitUntilCountOrTimeout(Int32, UInt32, System.Threading.CancellationToken)+0x68
0a 00000040`6d9fda10 00007ff8`8a05b872 mscorlib_ni!System.Threading.SemaphoreSlim.Wait(Int32, System.Threading.CancellationToken)+0x149
0b 00000040`6d9fdaa0 00007ff8`37a41a12 mscorlib_ni!System.Threading.SemaphoreSlim.Wait()+0x12
0c 00000040`6d9fdad0 00007ff8`8a071333 Felsokning_Services_ShadowService!Felsokning.Services.ShadowService.ShadowService+<>c__DisplayClass35_2.b__4(System.String)+0x3d2
0d 00000040`6d9fdc40 00007ff8`8a060680 mscorlib_ni!System.Threading.Tasks.Parallel+<>c__DisplayClass17_0`1[[System.__Canon, mscorlib]].b__1()+0x293
0e 00000040`6d9fdd00 00007ff8`8a12a540 mscorlib_ni!System.Threading.Tasks.Task.InnerInvokeWithArg(System.Threading.Tasks.Task)+0x20
0f 00000040`6d9fdd30 00007ff8`896a5ab3 mscorlib_ni!System.Threading.Tasks.Task+<>c__DisplayClass176_0.b__0(System.Object)+0x110
10 00000040`6d9fdda0 00007ff8`896a5944 mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+0x163
11 00000040`6d9fde70 00007ff8`896ef565 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+0x14
12 00000040`6d9fdea0 00007ff8`896eec93 mscorlib_ni!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)+0x215
13 00000040`6d9fdf50 00007ff8`896f3011 mscorlib_ni!System.Threading.Tasks.Task.ExecuteEntry(Boolean)+0x73
14 00000040`6d9fdf90 00007ff8`896f2eee mscorlib_ni!System.Threading.Tasks.ThreadPoolTaskScheduler.TryExecuteTaskInline(System.Threading.Tasks.Task, Boolean)+0x31
15 00000040`6d9fdfe0 00007ff8`8a05f8d7 mscorlib_ni!System.Threading.Tasks.TaskScheduler.TryRunInline(System.Threading.Tasks.Task, Boolean)+0x19e
16 00000040`6d9fe080 00007ff8`8a06c938 mscorlib_ni!System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler, Boolean)+0xa7
17 00000040`6d9fe110 00007ff8`8a06dbb7 mscorlib_ni!System.Threading.Tasks.Parallel.ForWorker[[System.__Canon, mscorlib]](Int32, Int32, System.Threading.Tasks.ParallelOptions, System.Action`1, System.Action`2, System.Func`4, System.Func`1, System.Action`1)+0x428
18 00000040`6d9fe1f0 00007ff8`8a06d46e mscorlib_ni!System.Threading.Tasks.Parallel.ForEachWorker[[System.__Canon, mscorlib],[System.__Canon, mscorlib]](System.Collections.Generic.IEnumerable`1, System.Threading.Tasks.ParallelOptions, System.Action`1, System.Action`2, System.Action`3, System.Func`4, System.Func`5, System.Func`1, System.Action`1)+0xd7
19 00000040`6d9fe2b0 00007ff8`37a4155a mscorlib_ni!System.Threading.Tasks.Parallel.ForEach[[System.__Canon, mscorlib]](System.Collections.Generic.IEnumerable`1, System.Action`1)+0x9e
1a 00000040`6d9fe350 00007ff8`8a071333 Felsokning_Services_ShadowService!Felsokning.Services.ShadowService.ShadowService.b__35_3(System.String)+0xea
1b 00000040`6d9fe3e0 00007ff8`8a060680 mscorlib_ni!System.Threading.Tasks.Parallel+<>c__DisplayClass17_0`1[[System.__Canon, mscorlib]].b__1()+0x293
1c 00000040`6d9fe4a0 00007ff8`8a12a540 mscorlib_ni!System.Threading.Tasks.Task.InnerInvokeWithArg(System.Threading.Tasks.Task)+0x20
1d 00000040`6d9fe4d0 00007ff8`896a5ab3 mscorlib_ni!System.Threading.Tasks.Task+<>c__DisplayClass176_0.b__0(System.Object)+0x110
1e 00000040`6d9fe540 00007ff8`896a5944 mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+0x163
1f 00000040`6d9fe610 00007ff8`896ef565 mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+0x14
20 00000040`6d9fe640 00007ff8`896eec93 mscorlib_ni!System.Threading.Tasks.Task.ExecuteWithThreadLocal(System.Threading.Tasks.Task ByRef)+0x215
21 00000040`6d9fe6f0 00007ff8`896f3011 mscorlib_ni!System.Threading.Tasks.Task.ExecuteEntry(Boolean)+0x73
22 00000040`6d9fe730 00007ff8`896f2eee mscorlib_ni!System.Threading.Tasks.ThreadPoolTaskScheduler.TryExecuteTaskInline(System.Threading.Tasks.Task, Boolean)+0x31
23 00000040`6d9fe780 00007ff8`8a05f8d7 mscorlib_ni!System.Threading.Tasks.TaskScheduler.TryRunInline(System.Threading.Tasks.Task, Boolean)+0x19e
24 00000040`6d9fe820 00007ff8`8a06c938 mscorlib_ni!System.Threading.Tasks.Task.InternalRunSynchronously(System.Threading.Tasks.TaskScheduler, Boolean)+0xa7
25 00000040`6d9fe8b0 00007ff8`8a06dc4f mscorlib_ni!System.Threading.Tasks.Parallel.ForWorker[[System.__Canon, mscorlib]](Int32, Int32, System.Threading.Tasks.ParallelOptions, System.Action`1, System.Action`2, System.Func`4, System.Func`1, System.Action`1)+0x428
26 00000040`6d9fe990 00007ff8`8a06d46e mscorlib_ni!System.Threading.Tasks.Parallel.ForEachWorker[[System.__Canon, mscorlib],[System.__Canon, mscorlib]](System.Collections.Generic.IEnumerable`1, System.Threading.Tasks.ParallelOptions, System.Action`1, System.Action`2, System.Action`3, System.Func`4, System.Func`5, System.Func`1, System.Action`1)+0x16f
27 00000040`6d9fea50 00007ff8`37a41376 mscorlib_ni!System.Threading.Tasks.Parallel.ForEach[[System.__Canon, mscorlib]](System.Collections.Generic.IEnumerable`1, System.Action`1)+0x9e
28 00000040`6d9feaf0 00007ff8`896a5ab3 Felsokning_Services_ShadowService!Felsokning.Services.ShadowService.ShadowService.StartShadowCopyWork(System.Object)+0x246
29 00000040`6d9fec00 00007ff8`896a5944 mscorlib_ni!System.Threading.ExecutionContext.RunInternal(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+0x163
2a 00000040`6d9fecd0 00007ff8`896db0ba mscorlib_ni!System.Threading.ExecutionContext.Run(System.Threading.ExecutionContext, System.Threading.ContextCallback, System.Object, Boolean)+0x14
2b 00000040`6d9fed00 00007ff8`896daeb7 mscorlib_ni!System.Threading.TimerQueueTimer.CallCallback()+0xba
2c 00000040`6d9fed60 00007ff8`89695b05 mscorlib_ni!System.Threading.TimerQueueTimer.Fire()+0x87
2d 00000040`6d9fedd0 00007ff8`97046da3 mscorlib_ni!System.Threading.TimerQueue.FireNextTimers()+0x75
2e 00000040`6d9fee50 00007ff8`97046c58 clr!CallDescrWorkerInternal+0x83
2f 00000040`6d9fee90 00007ff8`9704751d clr!CallDescrWorkerWithHandler+0x4e
30 00000040`6d9feed0 00007ff8`97049d14 clr!MethodDescCallSite::CallTargetWorker+0xf8
31 00000040`6d9fefd0 00007ff8`97047ce1 clr!AppDomainTimerCallback_Worker+0x34
32 00000040`6d9ff0c0 00007ff8`97047c50 clr!ManagedThreadBase_DispatchInner+0x39
33 00000040`6d9ff100 00007ff8`97047b8d clr!ManagedThreadBase_DispatchMiddle+0x6c
34 00000040`6d9ff200 00007ff8`97047d1f clr!ManagedThreadBase_DispatchOuter+0x75
35 00000040`6d9ff290 00007ff8`97049cb6 clr!ManagedThreadBase_FullTransitionWithAD+0x2f
36 00000040`6d9ff2f0 00007ff8`97049c16 clr!AppDomainTimerCallback+0x7c
37 00000040`6d9ff350 00007ff8`9704975f clr!ThreadpoolMgr::AsyncTimerCallbackCompletion+0x7e
38 00000040`6d9ff3a0 00007ff8`9704835c clr!UnManagedPerAppDomainTPCount::DispatchWorkItem+0x1bc
39 00000040`6d9ff440 00007ff8`97048105 clr!ThreadpoolMgr::ExecuteWorkRequest+0x64
3a 00000040`6d9ff470 00007ff8`971fefef clr!ThreadpoolMgr::WorkerThreadStart+0xf5
3b 00000040`6d9ff510 00007ff8`a43181f4 clr!Thread::intermediateThreadProc+0x86
3c 00000040`6d9ffbd0 00007ff8`a538a251 kernel32!BaseThreadInitThunk+0x14
3d 00000040`6d9ffc00 00000000`00000000 ntdll!RtlUserThreadStart+0x21

Threads: 1 of 13
Frames : 50
Command: knL

Right, so what’s the state of the SemaphoreSlim, itself?

0:006> !mdt 000001f45a4f3200
000001f45a4f3200 (System.Threading.SemaphoreSlim)
    m_currentCount:0x0 (System.Int32)
    m_maxCount:0x7fffffff (System.Int32)
    m_waitCount:0x6 (System.Int32)
    m_lockObj:000001f45a4f3240 (System.Object)
    m_waitHandle:NULL (System.Threading.ManualResetEvent)
    m_asyncHead:NULL (System.Threading.SemaphoreSlim+TaskNode)
    m_asyncTail:NULL (System.Threading.SemaphoreSlim+TaskNode)

The properties that we care about are the m_currentCount and the m_waitCount. We can look at the source code for SemaphoreSlim and see that these signify what the problem is:

m_currentCount – The semaphore count, initialized in the constructor to the initial value, every release call increments it and every wait call decrements it as long as its value is positive otherwise the wait will block. Its value must be between the maximum semaphore value and zero.

m_waitCount – The number of synchronously waiting threads, it is set to zero in the constructor and increments before blocking the threading and decrements it back after that. It is used as flag for the release call to know if there are waiting threads in the monitor or not.

So, here’s what happened: I leveraged Parallel.ForEach and for each disk action I required the caller to request wait on the semaphore, until it completes and releases the semaphore. The Semaphore’s initial value was stupid-low: 10. So, as soon as the second file was added, we created a condition in which multiple threads (because of parallelization) were requesting locks on the Semaphore and starving it out of being available for any further work.

In theory, this would be 1 lock for the source directory, 2 locks for the two target directories, 1 lock per file (so 2x2x2)… You can quickly see why this was a bad approach, even though it had the best intentions, at heart.

So, now, I’m refactoring the code to leverage threadpooling (or async tasks) and have definitely learned my lesson on resource starvation. 🙂

Thanks for coming to this TEDTalk and happy debugging!

Debugging One of the World’s Worst Error Messages

So, for a project for a friend with cancer, I was setting up something leveraging NextCloud (which another friend set up an account for me on their local instance for). It isn’t important who the friends were or what the project was but it’s slight backdrop into why I didn’t just use another program.

So, I installed the NextCloud desktop client on a Win10 (x64) machine and went to fire it up. Then came this gloriously (and mundanely) generic error message:

Markering_022

Very descriptive and actionable, yeah? (hint: LoadLibrary actually hints as to what’s going on, if you’ve run into assembly referencing before, but we’ll cover that later in this post.)

Well, the first thought is to go to the Event Viewer to see what happened. Only problem is: No events were logged. Anywhere. It never happened. It never even existed.

O.k., time to get medieval on this. Since the window never renders for the application and this seems to happen on initialisation, we’ll have to hook into the application to catch the event[s] in question.

Enter ADPlus. ADPlus has been around for hot-minute and isn’t, precisely, my go-to for debugging puposes; however, it’s definitely handy when you want to catch things happening on start-up.

So, we need to tell adplus to start and to monitor for an instance of the nextcloud application to loaded/referenced/scheduled by the kernel. (Littletid-bit here: The load happens before it’s reference is available in the process list, which is why we get the results you’ll see later-on.)

To tell adplus to “listen” for an instance of this application, we’ll do something like the following:

PS C:\Program Files (x86)\Windows Kits\10\Debuggers\x64> .\adplus.exe -Crash -pmn "nextcloud.exe" -o "C:\Dumps\NextCloud\"
Starting ADPlus
********************************************************
* *
* ADPLus Flash V 7.01.007 08/11/2011 *
* *
* For ADPlus documentation see ADPlus.doc *
* New command line options: *
* -pmn  - process monitor *
* waits for a process to start *
* -po  - optional process *
* won't fail if this process isn't running *
* -mss  *
* Sets Microsoft's symbol server *
* -r   *
* Runs -hang multiple times *
* *
* ADPlusManager - an additional tool to facilitate *
* the use of ADPlus in distributed environments like *
* computer clusters. *
* Learn about ADPlusManager in ADPlus.doc *
* *
********************************************************

Logs and memory dumps will be placed in C:\Dumps\NextCloud\\20190303_105927_Crash_Mode
Starting to monitor the following processes:
NEXTCLOUD.EXE
Press Enter to stop Monitoring...

Now, that we’ve set adplus to monitor, we fire-up nextcloud.exe. Adplus will attach and create a dump file for the first-chance exception (you can configure it for second-chance only).

So, now we have our dump file. We crack it open in Windbg and let’s look at the stack:

0:000> kncL
# Child-SP RetAddr Call Site
00 000000e2`704fdcc8 00007ff8`5c056db8 ntdll!NtTerminateProcess+0x14
01 000000e2`704fdcd0 00007ff8`5b35d3ba ntdll!RtlExitUserProcess+0xb8
02 000000e2`704fdd00 00007ff8`39a04e48 kernel32!ExitProcessImplementation+0xa
03 000000e2`704fdd30 00007ff8`39a0526e atig6pxx+0x4e48
04 000000e2`704fdef0 00007ff8`1289866c atig6pxx!DrvValidateVersion+0x16
05 000000e2`704fdf20 00007ff8`12898a7d opengl32!pgldrvLoadAndAllocDriverInfo+0x23c
06 000000e2`704fdfb0 00007ff8`128b2ca9 opengl32!LoadAvailableDrivers+0x39d
07 000000e2`704fe630 00007ff8`128b2055 opengl32!wglDescribePixelFormat+0x139
08 000000e2`704fe760 00007ff8`584d4c79 opengl32!wglChoosePixelFormat+0x85
09 000000e2`704fe810 00007ff8`1588eeb0 gdi32full!ChoosePixelFormat+0x39
0a 000000e2`704fe840 00007ff8`1588d540 qwindows+0x3eeb0
0b 000000e2`704fe990 00007ff8`1588eba6 qwindows+0x3d540
0c 000000e2`704fec50 00007ff8`1585c48b qwindows+0x3eba6
0d 000000e2`704fecf0 00007ff8`1585cdba qwindows+0xc48b
0e 000000e2`704fed50 00007ff8`1585ba5c qwindows+0xcdba
0f 000000e2`704fed80 00007ff8`11724d20 qwindows+0xba5c
10 000000e2`704fedf0 00007ff8`019b7770 Qt5Gui!QOpenGLContext::create+0x30
11 000000e2`704fee20 00007ff8`12d012da Qt5WebEngineCore!QtWebEngineCore::initialize+0x120
12 000000e2`704fee70 00007ff8`116ebd4d Qt5Core!QCoreApplicationPrivate::init+0x52a
13 000000e2`704fef30 00007ff8`131035cf Qt5Gui!QGuiApplicationPrivate::init+0x3d
14 000000e2`704ff190 00007ff8`130ff8bb Qt5Widgets!QApplicationPrivate::init+0xf
15 000000e2`704ff1c0 00007ff7`b1db1aa7 Qt5Widgets!QApplication::QApplication+0x5b
16 000000e2`704ff1f0 00007ff7`b1d039d4 nextcloud+0xc1aa7
17 000000e2`704ff300 00007ff7`b1cf1878 nextcloud+0x139d4
18 000000e2`704ff630 00007ff7`b1dc6524 nextcloud+0x1878
19 000000e2`704ff880 00007ff7`b1dc59c2 nextcloud+0xd6524
1a 000000e2`704ff910 00007ff8`5b3581f4 nextcloud+0xd59c2
1b 000000e2`704ff950 00007ff8`5c05a251 kernel32!BaseThreadInitThunk+0x14
1c 000000e2`704ff980 00000000`00000000 ntdll!RtlUserThreadStart+0x21

Normally, you might have to go frame by frame to understand what’s going on but since we don’t have the symbols for nextcloud or qwindow, we can kind of wing it and still figure out what’s going.

As the frames are read from bottom to top (in terms of processing order), the frames of interest to us will be 09 to 02 (which are in bold above). This shows us that GDI and/or OpenGL are looking for a driver that maps to a specific pixel format. We hit a frame in the ATI driver code that we can’t resolve to a method but then we see the call to terminate the process; so, we can probably discern that this is where the error screen is shown (the one with the LoadLibrary error).

So, if we going perusing around the interwebs, we should be able to confirm our suspicions and we’ll find this on MSDN, which has the following notation: “If the function fails, the return value is NULL. To get extended error information, call GetLastError.” Since this happens in native code and we’re unable to ascertain the method called in the ATI driver, it’s safe for us to assume that the ATI driver attempts to load a specified assembly and that assembly doesn’t exist. Also, since the exception is trapped and throw into the window, this is why the .excr record doesn’t exist on the stack. (Yay for native and scopes.)

Easiest solution: Install the nextcloud client on a different machine and do what I need to do or attempt to update the ATI video driver and hope that it resolves the issue. The former is a much more assured way to resolve it, though, so that’s what I did.

When developing programs, it’s important to consider that exceptions might occur in third-party dependencies. This mightn’t be the most productive error messages for users to figure out what’s going on. In fact, I whole-heartedly believe that if I hadn’t ever dealt with loading libraries or debugging, I’d have no clue as to why the program was broken.

This makes the error message one of the world’s crappiest error messages simply because it is inactionable by a user, if a user hasn’t the necessary tools and experience explained above.

So, then, what is the point of the error message?

I get the point, don’t get me wrong, in that lets someone know something went wrong. However, the general nature of the error message, itself, leads much to be desired. What library was it trying to load? Why did it fail? I’m sure I could eventually find a mapping for the NT Status errors that relate to this but why should I, as an end-user, have to resort to that level of depth just to figure out the problem?

The problem is shared between the application writers (nextcloud), GDI, OpenGL, and ATI. Each layer should expect an exception to be possible and to trap and raise it with the requisite information to make it actionable.

This is just… “Something happened. Good luck!”

Thanks for coming to this TEDTalk and I hope that it hasn’t bored you to death, despite it’s brevity.

Debugging: Incremental Mailbox Sync Move Requests Cause ‘StopAll’ to be called on the Assistants Infrastructure Which Causes CalendarNotificationAssistant to Stop Processing Text Message (SMS) Notifications

NOTE: This post – drafted, composed, written, and published by me – originally appeared on https://blogs.technet.microsoft.com/johnbai and is potentially (c) Microsoft.

Not too long ago, I debugged an issue with the Calendar Notification Assistant (CNA), wherein the Assistant would stop processing text message (SMS) notifications for events. The affected end-users noted that if they enabled text message (SMS) notifications in Outlook Web Access (OWA), the notifications would work for around 24 hours and then they would fall back into reproduction of the issue, in which the notifications stop altogether.

When I went looking in PowerShell, I could verify (via the Admin Audit Log) that the commands to set the configuration for SMS were ran and that it was enabled:

Search-AdminAuditLog -Cmdlets Set-TextMessagingAccount | Where{$_.ObjectModified -contains '<objectId>'} | FL


RunspaceId         : c7683058-b25c-4557-818d-546cf5f178c6
ObjectModified     : <objectId>
CmdletName         : Set-TextMessagingAccount
CmdletParameters   : {CountryRegionId, MobileOperatorId, NotificationPhoneNumber, Identity}
ModifiedProperties : {}
Caller             : <callerObjectId>
ExternalAccess     :
Succeeded          : True
Error              : None
RunDate            : 12/22/2015 1:15:17 PM
OriginatingServer  : <serverName>
Identity           : <identity>
IsValid            : True
ObjectState        : New

We could also see the messages being sent via the Message Tracking Log (MTL):

$MsgTrk | Sort TimeStamp | FT EventId,Source,Sender,Recipients,MessageSubject,TimeStamp -AutoSize

EventId Source      Sender                       Recipients                                  MessageSubject 	Timestamp
------- ------      ------                       ----------                                  -------------- 	---------
RECEIVE STOREDRIVER <senderEmailAddress> 	{[email protected]}                		12/17/2015 13:18:34
SEND    AGENT       <senderEmailAddress> 	{[email protected]}                		12/17/2015 13:18:34
RECEIVE STOREDRIVER <senderEmailAddress> 	{[email protected]}                		12/17/2015 13:19:46
SEND    AGENT       <senderEmailAddress> 	{[email protected]}                		12/17/2015 13:19:46
RECEIVE STOREDRIVER <senderEmailAddress> 	{[email protected]}                		12/17/2015 16:15:02
SEND    AGENT       <senderEmailAddress> 	{[email protected]}                		12/17/2015 16:15:02

Using this, I could tell the originator of the message: The Mailbox Assistants. This was done by the seeing that the client and the server were the same:

$MsgTrk[0] | FL


RunspaceId              : 13ad4581-25d9-4136-8ed1-a284b7e979db
Timestamp               : 12/21/2015 15:40:31
ClientIp                : <ipAddress>
ClientHostname          : <serverName>
ServerIp                : <ipAddress>
ServerHostname          : <serverName>
SourceContext           : 08D30340C2F43943
ConnectorId             :
Source                  : STOREDRIVER
EventId                 : RECEIVE
InternalMessageId       : 2429739
MessageId               : <messageId>
Recipients              : {[email protected]}
RecipientStatus         : {To}
TotalBytes              : 6386
RecipientCount          : 1
RelatedRecipientAddress :
Reference               :
MessageSubject          :
Sender                  : <emailAddress>
ReturnPath              : <emailAddress>
MessageInfo             : 04I:
MessageLatency          :
MessageLatencyType      : None
EventData               : {[MailboxDatabaseGuid, 347a971c-e8f5-45de-8e01-870ed4dc6901], [ItemEntryId, 00-00-00-00-1E-95-32-44-0A-EA-D6-47-A5-98-9B-F0-5F-89-B9-09-07-00-BE-9B-24-68-00-23-66-49-95-41-BB-56-C6-02-E3-D0-00-00-00-03-71-5C-00-00-75-FB-A6-9E-77-CF-BE-43-A8-2A-B9-7F-C1-34-9A-01-00-00-1C-C-91-D1-00-00]}

Once I knew this, I checked the Database Event Watermarks for any backlog:

<databaseName>\<serverName> CalendarNotificationAssistant             977234167 989106153    11871986

I now had enough data to assert that a dump of the process was necessary in order to see what’s going on.

At the time of the dump, despite the fact that the user was no longer receiving text message (SMS) notifications, the process was processing (for lack of a better turn-of-phrase at the moment) a two week old message in thread 21 in the stack:

0:021> !mdt 00000000036e1088
00000000036e1088 (Microsoft.Exchange.MailboxAssistants.Assistants.CalendarNotification.CalendarEvent)
    <CalendarItemIdentity>k__BackingField:00000000036e1208 (Microsoft.Exchange.Data.Storage.StoreObjectId)
    <CalendarItemOccurrenceIdentity>k__BackingField:00000000036e1048 (Microsoft.Exchange.Data.Storage.OccurrenceStoreObjectId)
    <CalendarItemType>k__BackingField:0x1 (Microsoft.Exchange.Data.Storage.CalendarItemType)
    <NormalizedSubject>k__BackingField:00000000036e0880 (System.String) Length=5, String="Lunch"
    <Location>k__BackingField:00000000036e0860 (System.String) 
    <ReminderIsSet>k__BackingField:true (System.Boolean)
    <ReminderTime>k__BackingField:(Microsoft.Exchange.ExchangeSystem.ExDateTime) VALTYPE (MT=000007ff002c5cb0, ADDR=00000000036e10d0)
    <ReminderMinutesBeforeStart>k__BackingField:0xf (System.Int32)
    <StartTime>k__BackingField:(Microsoft.Exchange.ExchangeSystem.ExDateTime) VALTYPE (MT=000007ff002c5cb0, ADDR=00000000036e10f0)
    <EndTime>k__BackingField:(Microsoft.Exchange.ExchangeSystem.ExDateTime) VALTYPE (MT=000007ff002c5cb0, ADDR=00000000036e1110)
    <FreeBusyStatus>k__BackingField:0x2 (Microsoft.Exchange.Data.Storage.BusyType)
    <ChangeHighlight>k__BackingField:0x0 (Microsoft.Exchange.Data.Storage.ChangeHighlightProperties)
    <AppointmentState>k__BackingField:0x0 (Microsoft.Exchange.Data.Storage.AppointmentStateFlags)
    <CreationRequestTime>k__BackingField:(Microsoft.Exchange.ExchangeSystem.ExDateTime) VALTYPE (MT=000007ff002c5cb0, ADDR=00000000036e1130)
    <ResponseType>k__BackingField:0x0 (Microsoft.Exchange.Data.Storage.ResponseType)
    <OldStartTime>k__BackingField:(System.Nullable`1) VALTYPE (MT=000007fef6207cc8, ADDR=00000000036e1150)
    <OldEndTime>k__BackingField:(System.Nullable`1) VALTYPE (MT=000007fef6207cc8, ADDR=00000000036e1178)

We can see the date of the item from the backing field:

0:021> !mdt 000007ff002c5cb0 00000000036e10d0
(Microsoft.Exchange.ExchangeSystem.ExDateTime) VALTYPE (MT=000007ff002c5cb0, ADDR=00000000036e10d0)
    timeZone:00000000014bb040 (Microsoft.Exchange.ExchangeSystem.ExTimeZone)
    universalTime:(System.DateTime) 2015/12/23 16:15:00.000 VALTYPE (MT=000007fef62482d0, ADDR=00000000036e10d8)
    localTime:(System.Nullable`1) VALTYPE (MT=000007fef6207cc8, ADDR=00000000036e10e0)

And we can see that it’s processing the item for the same user that lodged the complaint:

0:021> !mdt 0000000003761ad0
0000000003761ad0 (System.String) Length=129, String="<sameUserFromAbove>"

Finally, after perusing the dump for an hour or more, I found what I believed to be introducing the issue:

<ContextInfo>k__BackingField:00000000032fe908 (System.String) Length=7, String="StopAll"

Given this, it was time to go to the devs and figure out what was going on. After some back-and-forth and a conference call with both the Store and Calendaring devs, we arrived at what was causing the issue.

First, it’s important to understand that an incremental sync was created for the user but had not – as of the time of debugging – been completed:

Get-MoveRequestStatistics 957ca0a9-f84f-425e-a263-5b468e90d574 | FL *increm*


EffectiveIncrementalSyncInterval  : 1.00:00:00
ConfiguredIncrementalSyncInterval : 1.00:00:00

Since it syncs once a day, this aligns with the end-user’s statement regarding reconfiguration but we needed proof.

When we went and looked at the Mailbox Database Events for the mailbox in question, we saw the ‘MailboxMoveStarted’ event fire:

$eventz | FT Counter,ItemType,EventName,ObjectClass,Mailbox,CreateTime -AutoSize

  Counter ItemType   EventName          ObjectClass Mailbox CreateTime
  ------- --------   ---------          ----------- ------- ----------
974015558 MAPI_STORE MailboxMoveStarted                     1/25/2016 21:32:01
974683553 MAPI_STORE MailboxMoveStarted                     1/26/2016 21:41:10
975414298 MAPI_STORE MailboxMoveStarted                     1/27/2016 21:44:00
976136834 MAPI_STORE MailboxMoveStarted                     1/28/2016 21:52:10
976842768 MAPI_STORE MailboxMoveStarted                     1/29/2016 22:02:30
977056359 MAPI_STORE MailboxMoveStarted                     1/30/2016 22:03:45
977195259 MAPI_STORE MailboxMoveStarted                     1/31/2016 22:15:01

What’s important to note as a side-comment/take-away is that the 24-hour interval doesn’t initiate from the same start time. This is because the counter for the 24-hour interval starts when the last sync stopped.

When we matched this against a trace of the Assistant, we saw the following:

"2016/01/13-03:06:05.279","9444","2480","Assistants","EventDispatcher","Dispatcher for <userDisplayName> (957ca0a9-f84f-425e-a263-5b468e90d574) and Calendar Notification Assistant: Event handled successfully: Counter: 0x3B331E14; MailboxID: 957ca0a9-f84f-425e-a263-5b468e90d574; Mask: MailboxMoveStarted; Flags: Source; ExtendedFlags: None;Object Class: ; Created Time: 1/13/2016 11:06:04 AM;Item Type: MAPI_STOREItem EntryId: Parent entryId: Old Item entryId: Old parent entryId: SID: S-1-5-21-276227223-3107228690-1043224028-159164Client Type: MigrationDocument ID: 0" 
"85868820"        020B591B          Debug 2016/01/13-02:01:49.777 9444     2480            MailboxAssistants.Assistants.CalendarNotification Assistant           rmd----- caller: StopAll, subj: <subject>, usr: <objectId>c1, calItemId: RgAAAAAelTJECurWR6WYm/BfibkJBwC+myRoACNmSZVBu1bGAuPQAAAAA3FaAAB1+6aed8++Q6gquX/BNJoBAAAcyVmOAAAP, calItemOccId: , event_t: 01/13/2016 16:15:00, rmd_t: 01/13/2016 19:45:00, s_t: 01/13/2016 20:00:00, e_t: 01/13/2016 21:30:00
"8326743"         006B5AB0          Debug 2016/01/13-03:06:05.277 9444     2480            MailboxAssistants.Assistants.CalendarNotification UserSettings      Processing Disabling event for mailbox <objectId> on database 347a971c-e8f5-45de-8e01-870ed4dc6901

To elaborate on this bug and explain it further: Recall back in Exchange 2007 (E12) that mailbox moves caused the mailbox to be locked. When this occurred, it would’ve been necessary for any code that attempts to open the message store to stop processing the mailbox until the move is complete. Fast-forward a few years (nigh to a decade) and incremental sync move requests are a common, every day occurrence. When the incremental sync starts, there is no delineation between it and a regular move request, so a database event fires and this signals the assistant to stop processing the mailbox. When the text message (SMS) notification is reconfigured, it signals the Assistant to start processing on the mailbox. This explains the behavior seen by the end-users.

For those of you concerned about hitting this bug in the wild, we’ve only see it happen once and we’re using that instance to drive a change in the Assistant logic in all versions of Exchange (assuming all have the same bug in the logic).

Exchange 2010: AD Recipient Cache and Locking Threads in EdgeTransport – Or How We Figured Out Something Wasn’t Right With Mail Stuck in Submission Queue

NOTE: This post – drafted, composed, written, and published by me – originally appeared on https://blogs.technet.microsoft.com/johnbai and is potentially (c) Microsoft.

A while ago we had an issue, wherein SCOM fired for message queue length for messages in Submission. When we went to take a look, the messages were stuck in Submission and weren’t really going anywhere. Took a dump of EdgeTransport, paused Transport, and the queues eventually flushed.

Opened Windbg against the dump and here’s what was going on: Every time a flood of email would come, Transport would see that the local recipient cache for a group was out of date. Threads would spin-up (typically, per message) to update the recipient cache. Since this is an AD call, a lock is used. The lock is owned by the thread until it’s released, which means we need to wait for the LDAP call to complete before the next thread can enter the critical section. Repeat ad infinim. The issue that complicated this matter is that the refresh was for the same group of sixteen-thousand-plus users, each time.

In the dump, this is how I went down the rabbit hole. First, I checked for any locks that may be blocking in the process:

 

!locks

 

CritSec Wldap32!SelectLock1+0 at 000007fefdd5ada0

WaiterWoken        No

LockCount          0

RecursionCount     1

OwningThread       2a18

EntryCount         0

ContentionCount    2

*** Locked

 

I then checked the another thread to see if it was aware that there was contention (it was):

 

00 00000000`2b8fd4f8 000007fe`fd991430 ntdll!ZwWaitForMultipleObjects+0xa

01 00000000`2b8fd500 00000000`77731723 KERNELBASE!WaitForMultipleObjectsEx+0xe8

02 00000000`2b8fd600 000007fe`f830d0a5 kernel32!WaitForMultipleObjectsExImplementation+0xb3

03 00000000`2b8fd690 000007fe`f8311aa9 mscorwks!WaitForMultipleObjectsEx_SO_TOLERANT+0xc1

04 00000000`2b8fd730 000007fe`f8344e95 mscorwks!Thread::DoAppropriateAptStateWait+0x41

05 00000000`2b8fd790 000007fe`f8427d94 mscorwks!Thread::DoAppropriateWaitWorker+0x191

06 00000000`2b8fd890 000007fe`f83ee48a mscorwks!Thread::DoAppropriateWait+0x5c

07 00000000`2b8fd900 000007fe`f82f077d mscorwks!CLREvent::WaitEx+0xbe

08 00000000`2b8fd9b0 000007fe`f843e6ea mscorwks!AwareLock::EnterEpilog+0xc9

09 00000000`2b8fda80 000007fe`f83336db mscorwks!AwareLock::Enter+0x72

0a 00000000`2b8fdab0 000007fe`f88b308f mscorwks!AwareLock::Contention+0x1fb

0b 00000000`2b8fdb80 000007fe`f2ff2beb mscorwks!JITutil_MonContention+0xdf

0c 00000000`2b8fdd30 000007fe`f2ff318f Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.AutoRefreshCache`3[[System.__Canon, mscorlib],[System.__Canon, mscorlib],[System.__Canon, mscorlib]].CreateAndCache(System.__Canon, System.__Canon, Boolean)+0x16cb

0d 00000000`2b8fddf0 000007fe`f324dd16 Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.AutoRefreshCache`3[[System.__Canon, mscorlib],[System.__Canon, mscorlib],[System.__Canon, mscorlib]].GetValue(System.__Canon, System.__Canon)+0x192f

0e 00000000`2b8fdeb0 000007fe`f324e2a0 Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.IsMemberOfResolver`1[[System.__Canon, mscorlib]].IsMemberOf(Microsoft.Exchange.Data.Directory.Recipient.ADRecipientSession, System.Guid, System.Guid)+0x25b6a6

0f 00000000`2b8fe090 000007fe`f0940c15 Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.IsMemberOfResolver`1[[System.__Canon, mscorlib]].IsMemberOf(Microsoft.Exchange.Data.Directory.Recipient.ADRecipientSession, Microsoft.Exchange.Data.Directory.ADObjectId, System.__Canon)+0x25bfd0

10 00000000`2b8fe180 000007fe`e0fc7267 Microsoft_Exchange_Transport_ni!Microsoft.Exchange.Transport.RecipientAPI.AddressBookImpl.IsMemberOf(Microsoft.Exchange.Data.Transport.RoutingAddress, Microsoft.Exchange.Data.Transport.RoutingAddress)+0x135

11 00000000`2b8fe200 000007fe`f245c46a Microsoft_Exchange_MessagingPolicies_Rules_ni!Microsoft.Exchange.MessagingPolicies.Rules.MembershipChecker.Equals(System.String, System.String)+0x47

12 00000000`2b8fe250 000007fe`e0fd5584 Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.RuleUtils.CompareStringValues(System.Object, System.Object, Microsoft.Exchange.MessagingPolicies.Rules.IStringComparer)+0x7a

13 00000000`2b8fe2a0 000007fe`f2459efe Microsoft_Exchange_MessagingPolicies_Rules_ni!Microsoft.Exchange.MessagingPolicies.Rules.IsMemberOfPredicate.Evaluate(Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluationContext)+0x74

14 00000000`2b8fe2e0 000007fe`f2459efe Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.OrCondition.Evaluate(Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluationContext)+0x6e

15 00000000`2b8fe360 000007fe`f246a157 Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.OrCondition.Evaluate(Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluationContext)+0x6e

16 00000000`2b8fe3e0 000007fe`e0fd839f Microsoft_Exchange_Net_ni!Microsoft.Exchange.MessagingPolicies.Rules.RulesEvaluator.Run()+0x1b7

 

O.k., so if we’re update the cache, how many times are we doing it? Oh, I see…:

 

0:086> !findstack Microsoft_Exchange_Data_Directory_ni!Microsoft.Exchange.Data.Directory.IsMemberOfProvider.AutoRefreshCache`3 0

Thread 023, 2 frame(s) match

Thread 024, 2 frame(s) match

Thread 025, 2 frame(s) match

Thread 034, 2 frame(s) match

Thread 035, 2 frame(s) match

Thread 036, 2 frame(s) match

Thread 037, 2 frame(s) match

Thread 038, 2 frame(s) match

Thread 039, 2 frame(s) match

Thread 040, 2 frame(s) match

Thread 050, 2 frame(s) match

Thread 051, 2 frame(s) match

Thread 052, 2 frame(s) match

Thread 053, 2 frame(s) match

Thread 054, 2 frame(s) match

Thread 055, 2 frame(s) match

Thread 056, 2 frame(s) match

Thread 057, 2 frame(s) match

Thread 058, 2 frame(s) match

Thread 059, 2 frame(s) match

Thread 060, 2 frame(s) match

Thread 061, 2 frame(s) match

Thread 062, 2 frame(s) match

Thread 063, 2 frame(s) match

Thread 064, 2 frame(s) match

Thread 065, 2 frame(s) match

Thread 066, 2 frame(s) match

Thread 067, 2 frame(s) match

Thread 068, 2 frame(s) match

Thread 069, 2 frame(s) match

Thread 070, 2 frame(s) match

Thread 071, 2 frame(s) match

Thread 072, 2 frame(s) match

Thread 073, 2 frame(s) match

Thread 074, 2 frame(s) match

Thread 075, 2 frame(s) match

Thread 076, 2 frame(s) match

Thread 077, 2 frame(s) match

Thread 078, 2 frame(s) match

Thread 079, 2 frame(s) match

Thread 080, 2 frame(s) match

Thread 081, 2 frame(s) match

Thread 082, 2 frame(s) match

Thread 083, 2 frame(s) match

Thread 084, 2 frame(s) match

Thread 085, 2 frame(s) match

Thread 086, 2 frame(s) match

Thread 087, 2 frame(s) match

 

The thread with the lock has an LDAP call being made:

 

00 00000000`5443cfc8 000007fe`fce91e35 ntdll!ZwWaitForSingleObject+0xa

01 00000000`5443cfd0 000007fe`fce91f96 mswsock!SockWaitForSingleObject+0xe145

02 00000000`5443d050 000007fe`ff1f4efc mswsock!WSPSelect+0xac26

03 00000000`5443d1f0 000007fe`ff1f4e7d ws2_32!DPROVIDER::WSPSelect+0x3c

04 00000000`5443d230 000007fe`fdd13a19 ws2_32!select+0x75a

05 00000000`5443d330 000007fe`fdd199b5 Wldap32!DrainWinsock+0x417

06 00000000`5443d3c0 000007fe`fdd136ec Wldap32!LdapWaitForResponseFromServer+0x646

07 00000000`5443d480 000007fe`fdd1d967 Wldap32!ldap_result_with_error+0x279

08 00000000`5443d520 000007fe`f84abec7 Wldap32!ldap_result+0x74

*** WARNING: Unable to verify checksum for System.DirectoryServices.Protocols.ni.dll

09 00000000`5443d580 000007fe`f1c453ff mscorwks!DoNDirectCall__PatchGetThreadCall+0x7b

0a 00000000`5443d630 000007fe`f1c5c61b System_DirectoryServices_Protocols_ni!DomainBoundILStubClass.IL_STUB(IntPtr, Int32, Int32, System.DirectoryServices.Protocols.LDAP_TIMEVAL, IntPtr ByRef)+0x8f

0b 00000000`5443d720 000007fe`f1c59f79 System_DirectoryServices_Protocols_ni!System.DirectoryServices.Protocols.LdapConnection.ConstructResponse(Int32, System.DirectoryServices.Protocols.LdapOperation, System.DirectoryServices.Protocols.ResultAll, System.TimeSpan, Boolean)+0x2cb

 

It looks like we’re performing this query, at the time of the dump:

 

000000005443d538  00000000e0188bf8  System.String  “(&(memberOf=<GUID=20ff7cf3-3b7e-4767-8669-d2295d6fb0f3>)(!(msExchCU=*)))”

 

0:086> !do 00000000e0188bf8

Name: System.String

MethodTable: 000007fef7747d90

EEClass: 000007fef734e560

Size: 170(0xaa) bytes

GC Generation: 2

(C:\Windows\assembly\GAC_64\mscorlib\2.0.0.0__b77a5c561934e089\mscorlib.dll)

String: (&(memberOf=<GUID=20ff7cf3-3b7e-4767-8669-d2295d6fb0f3>)(!(msExchCU=*)))

Fields:

              MT            Field           Offset                 Type VT             Attr            Value Name

000007fef774f000  4000096        8         System.Int32  1 instance               73 m_arrayLength

000007fef774f000  4000097        c         System.Int32  1 instance               72 m_stringLength

000007fef77497d8  4000098       10          System.Char  1 instance               28 m_firstChar

000007fef7747d90  4000099       20        System.String  0   shared           static Empty

                                 >> Domain:Value  00000000004e5e50:000000015fff0370 <<

000007fef7749688  400009a       28        System.Char[]  0   shared           static WhitespaceChars

                                 >> Domain:Value  00000000004e5e50:000000015fff0b90 <<

 

O.k., how many calls against this object are we making?:

 

0:086> !refs 00000000e0188bf8

This command will work faster with an SOSEX heap index.  To build an index, run !bhi.

Objects referenced by 00000000e0188bf8 (System.String):

NONE

 

Objects referencing 00000000e0188bf8 (System.String):

follow 00000000c0393040        80   System.Object[]

follow 00000000c0393108        96   System.Object[]

follow 00000000c04a1928        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000000c04a2960        72   System.Object[]

follow 00000000c0689800        80   System.Object[]

follow 00000000c06898c8        96   System.Object[]

follow 00000000c07c3e68        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000000c07c4de0        72   System.Object[]

follow 00000000c08ba200        80   System.Object[]

follow 00000000c08ba2c8        96   System.Object[]

follow 00000000c0976870        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000000c09777e8        72   System.Object[]

follow 00000000c0fbaf58        80   System.Object[]

follow 00000000c0fbb020        96   System.Object[]

follow 00000000c103fa40        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000000c1046fd0        72   System.Object[]

follow 00000000c113a868        80   System.Object[]

follow 00000000c113a930        96   System.Object[]

follow 00000000c11e75a0        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000000c11e8518        72   System.Object[]

follow 00000000c13b1068        80   System.Object[]

follow 00000000c13b1130        96   System.Object[]

follow 00000000c1547260        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000000c15481d8        72   System.Object[]

follow 00000000c1669338        80   System.Object[]

follow 00000000c1669400        96   System.Object[]

follow 00000000c175b778        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000000c175c6f0        72   System.Object[]

follow 00000000c1904c80        80   System.Object[]

follow 00000000c1904d48        96   System.Object[]

follow 00000000c1acba70        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000000c1acc9e8        72   System.Object[]

follow 00000000c1d03f08        80   System.Object[]

follow 00000000c1d03fd0        96   System.Object[]

follow 00000000c1f8f398        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000000c1f90310        72   System.Object[]

follow 00000000c21bd2f0        80   System.Object[]

follow 00000000c21bd3b8        96   System.Object[]

follow 00000000c243d8a0        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000000c243e818        72   System.Object[]

follow 00000000e0188510       176   Microsoft.Exchange.Data.Directory.ADPagedReader`1[[Microsoft.Exchange.Data.Directory.ADRawEntry, Microsoft.Exchange.Data.Directory]]

follow 00000000f096aea8        80   System.DirectoryServices.Protocols.SearchRequest

follow 00000001106ce9f8        80   System.DirectoryServices.Protocols.SearchRequest

 

CLR Thread 0x49

stack:000000005443d538

stack:000000005443d5d0

stack:000000005443d970

  

O.k., how many threads do we have waiting on this?:

 

0:086> !syncblk

Index         SyncBlock MonitorHeld Recursion Owning Thread Info          SyncBlock Owner

  148 000000004ef62350           95         1 000000004ef6a070  2a18  86   0000000080074a08 System.Object

            Waiting threads: 23 24 25 34 35 36 37 38 39 40 50 51 52 53 54 55 56 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 87

—————————–

Total           1548

CCW             0

RCW             0

ComClassFactory 0

Free            0

 

In the words of Strongbad: That is not a very small number. That is a very BIG number!

 

For those of you wondering what this means in common, everyday production use of Exchange:

 

Mail would come in, destined for ‘x’ recipient. Transport would see a rule that has the condition, IsMemberOf. Then, Transport would see that the local recipient cache for the group it needs to verify that the recipient is a member of is out-of-date. It would call a thread to update it. Since threads aren’t context-aware, this would occur for many of the recipients currently in the Submission Queue. The thread making the LDAP call would own the lock, as it entered the critical section. The other threads had to wait for the lock to be released to make their AD call. Since only one thread in the process’ bounds can own a lock, at a time, this means the threads could only complete in a one-by-one scenario. This is also why the mail couldn’t be categorized because the lock prevented the thread from validating IsMemberOf without a current cache.

 

For those of you concerned if you’ll run into this issue: This condition only reproduces in a single instance and that instance has only recurred twice, so far, during the timespan of a year. Rest assured, that even though the perceivable affected footprint of this issue is small, it is being investigated by the Transport developers.