This blog is built directly from a customer reported issue. As I helped investigate the source of the issue I thought it would be of interest to a broader audience – hopefully you find this interesting, as well.
Allow me to provide some history of the problem before I dive into extended details.
The SQL Server error log was flooded with the following pattern, approximately every ~1 second. Customer indicated lots of CLR_* wait types and mini-dumps revealed heavy GC (garbage collection) activity taking place.
...
2013-01-22 08:07:44.91 spid37s AppDomain 890288 (mssqlsystemresource.dbo[ddl].890287) unloaded.
2013-01-22 08:07:45.73 spid31s AppDomain 890289 (mssqlsystemresource.dbo[ddl].890288) unloaded.
2013-01-22 08:07:46.41 spid34s AppDomain 890290 (mssqlsystemresource.dbo[ddl].890289) unloaded.
...
The domain name was always the (same). Domain names are based on database and object owner and given a generation id each time they are loaded. The pattern shown above indicates the SQL Server has loaded and unloaded the mssqlsystemresource (DBO) application domain 890,000+ times – ouch!
This is not the normal pattern, as you might have imagined already. The normal pattern is to show the domain loaded message paired with a matching unloaded message and separated by a reasonable length of time as the domain is utilized. The error log was not showing any loaded messages, just the unloaded messages.
The other distinct difference that stood out is the text of the unloaded message. There are ~6 different unloaded messages which the SQL Server can log. All the other messages indicate a reason (out of memory, .NET exception, locking protocol violation, ….) This was just generically unloaded without any reason given and exactly why the customer wanted to know what the cause of the issue was.
Note: The stack traces and debugging activities are using public symbols and a SQL Server 2008 build. http://support.microsoft.com/kb/311503
SQL Server AppDomain Loaded Message
SQL Server does not log the application domain loaded message until the domain has been loaded and initialized. Loading and initializing the CLR interfaces for an application domain are two distinct states. This is important because technically the application domain is loaded within the CLR runtime (not fully initialized) and SQL Server does NOT record the application domain loaded message in the error log.
My Theory
Based on the behavior I had a hunch that this is a query cancellation or error occurring during the loading of the application domain and SQL Server didn’t reach full initialization so SQL Server is just starting up the domain (part way) and tearing it down (unloading) to cleanup properly.
Asking the customer what they were doing with CLR and looking at some traces (.TRC) the geography data type was in use and the supporting assembly is associated with a system database.
I was also able to use the sys.dm_os_ring_buffers to get information related to the application domain state machine changes. I found the domains were transitioning from creating to unloaded in within a few milliseconds.
Ring Buffer: RING_BUFFER_CLRAPPDOMAIN
Ring Buffer: RING_BUFFER_EXCEPTION
SQL Server handles the vast majority of error conditions by throwing the custom C++ exception type (SQL Exception). The exception holds a major and minor error code along with other information associated with the condition. You can use the formula (Major * 100) + (minor) to build the SQL Server error code or message_id as provided in sys.messages.
For a query cancellation this is an internal error (3617) or Major = 36 and Minor = 17.
The stack associated with the cancellation is the following, during the initialization of the application domain.
sqlservr!clr_ex_raise <------- Major = 36, Minor = 17
sqlservr!CAppDomain::CreateManagedDomain
sqlservr!CAppDomain::InitExpensive
sqlservr!CAppDomainManager::GetAppDomain
sqlservr!CCLRHost::GetAppDomain
sqlservr!CAssemblyMetaInfo::GetAppDomainForVerification
sqlservr!CAssemblyMetaInfo::CreateClrInterfaces
sqlservr!CAssemblyMetaInfo::InitClrInterfaces
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase
Associated with the CLR_APPDOMAIN, ring buffer are stack frames and I was able to see where the unload was getting triggered.
sqlservr!AppDomainRingBufferRecord::StoreRecord+0x9c
sqlservr!CAppDomain::StateTransition+0xcf
sqlservr!CAppDomainManager::AppDomainStateTransitionLockHeld+0xa5
sqlservr!CAppDomainManager::AppDomainStateTransition+0x30
sqlservr!CAppDomain::UnloadManaged+0x22
sqlservr!CAppDomain::Release+0xd3
sqlservr!CAutoRefc<CAppDomain>::~CAutoRefc<CAppDomain>+0x914a3f
sqlservr!CAssemblyMetaInfo::CreateClrInterfaces+0x135
sqlservr!CAssemblyMetaInfo::InitClrInterfaces+0x7b
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase+0x144
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase+0xa2
sqlservr!ResolveUdf+0x2c4
sqlservr!CAlgUtils::TrpGetExpressionPropsAlg+0x523
sqlservr!CAlgUtils::TrpGetExpressionPropsWithHandler+0x112
sqlservr!udf::FBindObject+0x5d
sqlservr!udf::FBind+0x2b3
A few key aspects of this stack of note.
- FBind – Used during compile so we are still compiling the query, not an execution/runtime issue so we can use a database clone to reproduce the problem.
- LoadAssemblyFromDatabase – Loading the assembly which can create the application domain and in this case is doing just that
- CreateClrInterfaces – Doing the initialization work – (We have not printed the AppDomain loaded message yet)
- UnloadManaged – Triggers an async application domain unload. It is important to note this is an async operation and the actual unload will occur on a system thread and exactly why the unloaded messages are occurring on a system (s) thread.
Now I needed to exercise this code path to validate my findings. I know if I restarted SQL Server the application domain would not be loaded and the plan would not be in procedure cache.
A coworker of mine (BillHol) assisted in creating a simple function that causes the Microsoft.SqlServer.SqlGeography assembly to be loaded under the mssqlserversystem.dbo application domain. (In SQL 2012 this loads under master.dbo.)
use tempdb
goCREATE FUNCTION dbo.funcBufferGeography(@p1 geography, @p2 float)
RETURNS geography
AS
BEGIN
DECLARE @g geography;
DECLARE @distance float;
SELECT @g = @p1.STAsText();
RETURN (@g);
END;
GO
I then went to the debugger and set a breakpoint. I wanted to create a simulated stall during the interface creation to see if I could trigger a query cancellation (attention) during this phase of the compile and would it result in a reproduction of the pattern. The breakpoint is just causing a break to wait 1.5 sec and then continue.
bp sqlservr!CAssemblyMetaInfo::CreateClrInterfaces ".sleep 1500;g"
I used OSTRESS, from the RML toolkit, to execute the function; providing a query timeout of 1 second so while the debugger has the SQL Server process stopped the client query timeout can predictably occur.
ostress -dtempdb -E -S.\sql2008 -Q"DECLARE @garg geography = 'LINESTRING(3 4, 8 11)'; SELECT dbo.funcBufferGeography(@garg, 1.1)" -oc:\temp\breakout –t1
Sure enough I was able to reproduce the behavior in the error log of repeated, unloaded messages without the matching loaded paring. In the debugger you can also see the C++ exceptions being thrown for the 3617 SQL Exception as well.
I can’t/don’t want to hook up a debugger to production!
It is not practical to hook up the debugger on the production instance but once we understand the pattern it is easy to see from a simple SQL Server trace and the ring buffer entries (previously shown.)
- The attention (internal 3617 C++ SQL Exception) is always logged after the BatchCompleted event and the Attention event. As you can see each BatchCompleted is followed by an attention.
- There is never an error log message for AppDomain loading.
- The unloading occurs on SPID=31 and it is marked (IsSystem = 1) maching the output in the error log of 31(s)
Garbage Collection (GC) and CLR_* Wait Types
While loading, initializing or unloading an application domain the SQL Server prevents additional CLR activity against the same application domain. This results in the CLR_* wait types as you might expect; only one thread can load the application domain. This is no different than loading a DLL and the operating system maintaining the process, loader lock (CriticalSection) during the image load and resolution processing.
In this customers case the mini-dump and sys.dm_exec_requests output revealed 34 additional waiters on the geography data type (mssqlsystemresource.dbo app domain.)
GC/Convoy: The GC activity is a side effect but helped cause a convoy on this customers system. When the domain is unloaded the CLR runtime forces a garbage collection (GC) across all generations to make sure anything related to the domain has been properly cleaned up. During GC (usually) all CLR activity is suspended no matter what application domain.
Here is what the convey is doing
- SPID 50 – Attempted to load application domain, failed and is unloading and performing GC activity.
- SPID 51, 52, 53 … 80 are all waiting on the SQL CLR_* protection object and can’t advance until SPID 50 is able to load and initialize or unload and cleanup the application domain.
- The time it takes for 50 to complete the unload/GC causes SPID 51, 52, 53, … 80 to timeout.
- SPID 50 completes and SPID 51 tries to load the application domain. The query cancellation is already queued so 51 detects this during initialization and issues the application domain unload again.
- SPID 51 completes and SPID 52 tries to load the application domain. …… you get the idea…..
The convey will flood the error log with application domain unload messages and no real work is getting done by the sessions. The server encounters stop and go behavior as it attempts to load the domain, allow CLR workers to execute, unload the domain and suspend CLR worker activities … repeating the behavior over and over again.
Solving the problem
The problem is no different than any another other resource bottleneck troubleshooting. Capturing traces, performance monitor logs and other outputs and tracking down why the original bottleneck occurred. In this case “Why is the system taking so long to load the assembly or is the query timeout improperly set to something tiny?”
Bob Dorr - Principal SQL Server Escalation Engineer