DLNA: Network Shares Cause Blocking Threads in Windows Explorer During Copying Events

TL;DR – DLNA shares have a bug which can cause Explorer to stop copying data across the share, nigh indefinitely. The easiest work-around is to restart Windows Explorer, delete the target file in the previous copy operation, and start anew.

I’ve discovered a bug that I can’t seem to get addressed because the assembly isn’t publicly documented, anywhere, but I figured that I would write about what happens to explain it to those of you who run into it.

First, we need to cover what DLNA (Digital Living Network Alliance) is. (Wiki article is here.) DLNA is a standard by which multiple libraries can be accessed for sharing/streaming, without having had a proprietary library to communicate between them.

Plex Media Server is one such media streaming service built using the DLNA libraries for sharing resources. One of the DLNA features, in Windows, is that it appears as a Network Share/Location in Windows Explorer.

So, we have a Plex Media Server and it’s serving DLNA. The media on the server is browsable, as it if were a dedicated network share. If we treat it as such and copy from one location to another, this is when this particular bug surfaces.

Microsoft has implement DLNA in x86 and x64 processes via assemblies included in Windows. In particular, for this bug, we care about the mfnetcore.dll assembly, which can be found in either the System32 or the SysWow64 folders.

Here’s a dump of the stack during repro:

81 TID:0e60 kb kbn kbnL kn knL kpn kPn
# Child-SP RetAddr Call Site
00 00000000`2fc8eb18 00007ffb`272683d3 ntdll!NtWaitForSingleObject+0x14
01 00000000`2fc8eb20 00007ffa`e0845d59 KERNELBASE!WaitForSingleObjectEx+0x93
02 00000000`2fc8ebc0 00007ffa`f135812f mfnetcore!MFGetSupportedDLNAProfileInfo+0xaa69
03 00000000`2fc8ec60 00007ffb`27ba03ed mfplat!CStreamOnMFByteStream::Read+0xef
04 00000000`2fc8ecb0 00007ffb`27ad01c5 windows_storage!SHCopyStreamWithProgress2+0x1ad
05 00000000`2fc8ed90 00007ffb`27ad03ce windows_storage!CCopyOperation::_CopyResourceStreams+0x89
06 00000000`2fc8ee00 00007ffb`278e50df windows_storage!CCopyOperation::_CopyResources+0x17e
07 00000000`2fc8eea0 00007ffb`276f784f windows_storage!CCopyOperation::Do+0x1b5cbf
08 00000000`2fc8efa0 00007ffb`276f5d4f windows_storage!CCopyWorkItem::_DoOperation+0x9b
09 00000000`2fc8f080 00007ffb`276f657a windows_storage!CCopyWorkItem::_SetupAndPerformOp+0x2a3
0a 00000000`2fc8f370 00007ffb`276f2f1e windows_storage!CCopyWorkItem::ProcessWorkItem+0x152
0b 00000000`2fc8f620 00007ffb`276f3907 windows_storage!CRecursiveFolderOperation::Do+0x1be
0c 00000000`2fc8f6c0 00007ffb`276f33d6 windows_storage!CFileOperation::_EnumRootDo+0x277
0d 00000000`2fc8f760 00007ffb`276fd25c windows_storage!CFileOperation::PrepareAndDoOperations+0x1c6
0e 00000000`2fc8f830 00007ffb`2874c525 windows_storage!CFileOperation::PerformOperations+0x10c
0f 00000000`2fc8f890 00007ffb`2874acf0 shell32!CFSDropTargetHelper::_MoveCopyHIDA+0x269
10 00000000`2fc8f940 00007ffb`2874d517 shell32!CFSDropTargetHelper::_Drop+0x220
11 00000000`2fc8fe20 00007ffb`29b6c315 shell32!CFSDropTargetHelper::s_DoDropThreadProc+0x37
12 00000000`2fc8fe50 00007ffb`2ab17974 SHCore!_WrapperThreadProc+0xf5
13 00000000`2fc8ff30 00007ffb`2aeba271 kernel32!BaseThreadInitThunk+0x14
14 00000000`2fc8ff60 00000000`00000000 ntdll!RtlUserThreadStart+0x21

Note that frames that we, generally, care about are in orange and red at the top of the stack and those are the last instructions executed in the thread. In this case, we’re waiting on a response from the request to get the supported DLNA profile information and this is demonstrated by the fact that we’re waiting on an object at the top of the stack. Essentially, we have an open/blocking request that has never completed and the thread will have to die to unblock the request.

We can see the block happen on other native threads. Specifically, in the dump that I created, there were three threads with the same stacks, shown as below.

97 TID:3cdc kb kbn kbnL kn knL kpn kPn
# Child-SP RetAddr Call Site
00 00000000`3448f9e8 00007ffb`299bf5cd win32u!NtUserMsgWaitForMultipleObjectsEx+0x14
01 00000000`3448f9f0 00007ffa`f88e2cfd user32!RealMsgWaitForMultipleObjectsEx+0x1d
02 00000000`3448fa30 00007ffa`f88e2c24 duser!CoreSC::Wait+0x75
03 00000000`3448fa80 00007ffb`299d05d1 duser!MphWaitMessageEx+0x104
04 00000000`3448fae0 00007ffb`2aef33c4 user32!_ClientWaitMessageExMPH+0x21
05 00000000`3448fb30 00007ffb`26f51224 ntdll!KiUserCallbackDispatcherContinue
06 00000000`3448fb98 00007ffa`fad54913 win32u!NtUserWaitMessage+0x14
07 00000000`3448fba0 00007ffa`fad547a9 explorerframe!CExplorerFrame::FrameMessagePump+0x153
08 00000000`3448fc20 00007ffa`fad546f6 explorerframe!BrowserThreadProc+0x85
09 00000000`3448fca0 00007ffa`fad55a12 explorerframe!BrowserNewThreadProc+0x3a
0a 00000000`3448fcd0 00007ffa`fad670c2 explorerframe!CExplorerTask::InternalResumeRT+0x12
0b 00000000`3448fd00 00007ffb`2785b58c explorerframe!CRunnableTask::Run+0xb2
0c 00000000`3448fd40 00007ffb`2785b245 windows_storage!CShellTask::TT_Run+0x3c
0d 00000000`3448fd70 00007ffb`2785b125 windows_storage!CShellTaskThread::ThreadProc+0xdd
0e 00000000`3448fe20 00007ffb`29b6c315 windows_storage!CShellTaskThread::s_ThreadProc+0x35
0f 00000000`3448fe50 00007ffb`2ab17974 SHCore!_WrapperThreadProc+0xf5
10 00000000`3448ff30 00007ffb`2aeba271 kernel32!BaseThreadInitThunk+0x14
11 00000000`3448ff60 00000000`00000000 ntdll!RtlUserThreadStart+0x21

Work-arounds: This bug is pretty ugly and there aren’t a whole lot of work-arounds for it. One could wait for the lifetime of the thread to cause an abort, which could be a considerable amount of time. The work-around that I typically opt for is to restart Windows Explorer process, via Task Manager, delete the file and try to copy it again. Sure, it takes a lot of time but it’s considerably a far lower cost, time-wise, than waiting for a thread to become unblocked due to a timeout.

Netlogon: Cross-Forest Delayed Authentication Requests Cause Subsequent (and Continuous) Authentication Failures

NOTE: This post – drafted, composed, written, and published by me – originally appeared on https://blogs.technet.microsoft.com/johnbai and is potentially (c) Microsoft.

One of the longest debugging experiences I’ve ever had to debug, so far, in Exchange was a code bug that exists in the Netlogon code. I hope to cover what this bug was, how it manifested, and the fix that was implemented by the Windows developer to resolve the issue. So, this is going to be a long one….

Picture (Worth 1,000 Words)

Netlogon sessions use RPC (remote-procedure call) sessions with domain controllers to communicate authentication requests to domains. In the cases of cross-forest authentication requests, regardless of the type of trust created (e.g.: one-way transitive, one-way intransitive, etc.), the cross-forest authentication requests are forwarded (via the trust) to the responsible domain. In this case, the responsible domains exist in the customer’s on-premises environments.

So, in the above example, you’ll see that the client communicates with the Café server. During authentication, the Café passes the request to the managed domain controller. The managed domain controller will see the trust and communicate the authentication request across the trust and receive only an NT status response back for the request from the customer’s domain controllers.

When Repro Cometh
The condition that causes repro to start occurring is when the local domain controller (in this case, the managed domain controller in the illustration above) is awaiting a response from the customer’s domain controller for an authentication request. In the authentication pipeline, if this request times-out, it’s considered a re-triable exception – which is important for later. Because this exception is retriable, a the Café server doesn’t consider that the authentication request has failed. Also, during this same time, the Café server may build a new Netlogon session with a new domain controller, which is where our problem begins to surface.

The Netlogon code has a single object reference for the domain controller’s name for the current Netlogon session on the current RPC session it should be using. (If you’re familiar with native/unmanaged code, the reference to the domain controller’s name is a pointer to a wchar_t value.) But remember: We’ve not disposed of the previous session because it’s considered re-triable. So, since Netlogon can only communicate on one session per one RPC channel, we now have two Netlogon sessions with two RPC channels. The non-disposed of session is in red and the new session is in green in the illustration above.

The Bug
The bug is that all subsequent authentication requests traverse the red authentication path but use the domain controller’s name that was obtained from the creation of the green authentication path (the domain controller’s name is supplied in the authentication request as is defined in the specifications). This causes all subsequent authentication requests to fail, no matter the destination forest, because the domain controller receives a request that it should not process.

Verifying Repro
The best way to verify the repro of this bug is to look at the Netlogon logs. If you see 0xc0000122 (STATUS_INVALID_COMPUTER_NAME), then you’ve hit repro of this specific condition. In Exchange, this will bubble-up via the app pool in IIS as a 401 Unauthorised (which makes chasing the bug a bit more complicated).

The Fix
Windows dev determined that the best way to fix this was to tear down both the Netlogon and RPC sessions, regardless of current status. This has been verified as working in RS3 builds of Windows 10/Server 2016 and is currently being tested in RS1 builds of Windows 10/Server 2016.