AppDomain unloading messages flooding the SQL Server error log

This blog is built directly from a customer reported issue. As I helped investigate the source of the issue I thought it would be of interest to a broader audience – hopefully you find this interesting, as well.

Allow me to provide some history of the problem before I dive into extended details.

The SQL Server error log was flooded with the following pattern, approximately every ~1 second. Customer indicated lots of CLR_* wait types and mini-dumps revealed heavy GC (garbage collection) activity taking place.

...
2013-01-22 08:07:44.91 spid37s     AppDomain 890288 (mssqlsystemresource.dbo[ddl].890287) unloaded.
2013-01-22 08:07:45.73 spid31s     AppDomain 890289 (mssqlsystemresource.dbo[ddl].890288) unloaded.
2013-01-22 08:07:46.41 spid34s     AppDomain 890290 (mssqlsystemresource.dbo[ddl].890289) unloaded.

...

The domain name was always the (same). Domain names are based on database and object owner and given a generation id each time they are loaded. The pattern shown above indicates the SQL Server has loaded and unloaded the mssqlsystemresource (DBO) application domain 890,000+ times – ouch!

This is not the normal pattern, as you might have imagined already. The normal pattern is to show the domain loaded message paired with a matching unloaded message and separated by a reasonable length of time as the domain is utilized. The error log was not showing any loaded messages, just the unloaded messages.

The other distinct difference that stood out is the text of the unloaded message. There are ~6 different unloaded messages which the SQL Server can log. All the other messages indicate a reason (out of memory, .NET exception, locking protocol violation, ….) This was just generically unloaded without any reason given and exactly why the customer wanted to know what the cause of the issue was.

Note: The stack traces and debugging activities are using public symbols and a SQL Server 2008 build. http://support.microsoft.com/kb/311503

SQL Server AppDomain Loaded Message

SQL Server does not log the application domain loaded message until the domain has been loaded and initialized. Loading and initializing the CLR interfaces for an application domain are two distinct states. This is important because technically the application domain is loaded within the CLR runtime (not fully initialized) and SQL Server does NOT record the application domain loaded message in the error log.

My Theory

Based on the behavior I had a hunch that this is a query cancellation or error occurring during the loading of the application domain and SQL Server didn’t reach full initialization so SQL Server is just starting up the domain (part way) and tearing it down (unloading) to cleanup properly.

Asking the customer what they were doing with CLR and looking at some traces (.TRC) the geography data type was in use and the supporting assembly is associated with a system database.

I was also able to use the sys.dm_os_ring_buffers to get information related to the application domain state machine changes. I found the domains were transitioning from creating to unloaded in within a few milliseconds.

Ring Buffer: RING_BUFFER_CLRAPPDOMAIN

Ring Buffer: RING_BUFFER_EXCEPTION

SQL Server handles the vast majority of error conditions by throwing the custom C++ exception type (SQL Exception). The exception holds a major and minor error code along with other information associated with the condition. You can use the formula (Major * 100) + (minor) to build the SQL Server error code or message_id as provided in sys.messages.
For a query cancellation this is an internal error (3617) or Major = 36 and Minor = 17.

The stack associated with the cancellation is the following, during the initialization of the application domain.

sqlservr!clr_ex_raise <------- Major = 36, Minor = 17
sqlservr!CAppDomain::CreateManagedDomain
sqlservr!CAppDomain::InitExpensive
sqlservr!CAppDomainManager::GetAppDomain
sqlservr!CCLRHost::GetAppDomain
sqlservr!CAssemblyMetaInfo::GetAppDomainForVerification
sqlservr!CAssemblyMetaInfo::CreateClrInterfaces
sqlservr!CAssemblyMetaInfo::InitClrInterfaces
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase

Associated with the CLR_APPDOMAIN, ring buffer are stack frames and I was able to see where the unload was getting triggered.

sqlservr!AppDomainRingBufferRecord::StoreRecord+0x9c
sqlservr!CAppDomain::StateTransition+0xcf
sqlservr!CAppDomainManager::AppDomainStateTransitionLockHeld+0xa5
sqlservr!CAppDomainManager::AppDomainStateTransition+0x30
sqlservr!CAppDomain::UnloadManaged+0x22
sqlservr!CAppDomain::Release+0xd3
sqlservr!CAutoRefc<CAppDomain>::~CAutoRefc<CAppDomain>+0x914a3f
sqlservr!CAssemblyMetaInfo::CreateClrInterfaces+0x135
sqlservr!CAssemblyMetaInfo::InitClrInterfaces+0x7b
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase+0x144
sqlservr!CAssemblyMetaInfo::LoadAssemblyFromDatabase+0xa2
sqlservr!ResolveUdf+0x2c4
sqlservr!CAlgUtils::TrpGetExpressionPropsAlg+0x523
sqlservr!CAlgUtils::TrpGetExpressionPropsWithHandler+0x112
sqlservr!udf::FBindObject+0x5d
sqlservr!udf::FBind+0x2b3

A few key aspects of this stack of note.

FBind – Used during compile so we are still compiling the query, not an execution/runtime issue so we can use a database clone to reproduce the problem.
LoadAssemblyFromDatabase – Loading the assembly which can create the application domain and in this case is doing just that
CreateClrInterfaces – Doing the initialization work – (We have not printed the AppDomain loaded message yet)
UnloadManaged – Triggers an async application domain unload. It is important to note this is an async operation and the actual unload will occur on a system thread and exactly why the unloaded messages are occurring on a system (s) thread.

Now I needed to exercise this code path to validate my findings. I know if I restarted SQL Server the application domain would not be loaded and the plan would not be in procedure cache.

A coworker of mine (BillHol) assisted in creating a simple function that causes the Microsoft.SqlServer.SqlGeography assembly to be loaded under the mssqlserversystem.dbo application domain. (In SQL 2012 this loads under master.dbo.)

use tempdb
go
CREATE FUNCTION dbo.funcBufferGeography(@p1 geography, @p2 float)
RETURNS geography
AS
BEGIN
       DECLARE @g geography;
       DECLARE @distance float;
       SELECT @g = @p1.STAsText();

       RETURN (@g);
END;
GO

I then went to the debugger and set a breakpoint. I wanted to create a simulated stall during the interface creation to see if I could trigger a query cancellation (attention) during this phase of the compile and would it result in a reproduction of the pattern. The breakpoint is just causing a break to wait 1.5 sec and then continue.

bp sqlservr!CAssemblyMetaInfo::CreateClrInterfaces ".sleep 1500;g"

I used OSTRESS, from the RML toolkit, to execute the function; providing a query timeout of 1 second so while the debugger has the SQL Server process stopped the client query timeout can predictably occur.

ostress -dtempdb -E -S.\sql2008 -Q"DECLARE @garg geography = 'LINESTRING(3 4, 8 11)'; SELECT dbo.funcBufferGeography(@garg, 1.1)" -oc:\temp\breakout –t1

Sure enough I was able to reproduce the behavior in the error log of repeated, unloaded messages without the matching loaded paring. In the debugger you can also see the C++ exceptions being thrown for the 3617 SQL Exception as well.

I can’t/don’t want to hook up a debugger to production!

It is not practical to hook up the debugger on the production instance but once we understand the pattern it is easy to see from a simple SQL Server trace and the ring buffer entries (previously shown.)

The attention (internal 3617 C++ SQL Exception) is always logged after the BatchCompleted event and the Attention event. As you can see each BatchCompleted is followed by an attention.
There is never an error log message for AppDomain loading.
The unloading occurs on SPID=31 and it is marked (IsSystem = 1) maching the output in the error log of 31(s)

Garbage Collection (GC) and CLR_* Wait Types

While loading, initializing or unloading an application domain the SQL Server prevents additional CLR activity against the same application domain. This results in the CLR_* wait types as you might expect; only one thread can load the application domain. This is no different than loading a DLL and the operating system maintaining the process, loader lock (CriticalSection) during the image load and resolution processing.

In this customers case the mini-dump and sys.dm_exec_requests output revealed 34 additional waiters on the geography data type (mssqlsystemresource.dbo app domain.)

GC/Convoy: The GC activity is a side effect but helped cause a convoy on this customers system. When the domain is unloaded the CLR runtime forces a garbage collection (GC) across all generations to make sure anything related to the domain has been properly cleaned up. During GC (usually) all CLR activity is suspended no matter what application domain.

Here is what the convey is doing

SPID 50 – Attempted to load application domain, failed and is unloading and performing GC activity.
SPID 51, 52, 53 … 80 are all waiting on the SQL CLR_* protection object and can’t advance until SPID 50 is able to load and initialize or unload and cleanup the application domain.
The time it takes for 50 to complete the unload/GC causes SPID 51, 52, 53, … 80 to timeout.
SPID 50 completes and SPID 51 tries to load the application domain. The query cancellation is already queued so 51 detects this during initialization and issues the application domain unload again.
SPID 51 completes and SPID 52 tries to load the application domain. …… you get the idea…..

The convey will flood the error log with application domain unload messages and no real work is getting done by the sessions. The server encounters stop and go behavior as it attempts to load the domain, allow CLR workers to execute, unload the domain and suspend CLR worker activities … repeating the behavior over and over again.

Solving the problem

The problem is no different than any another other resource bottleneck troubleshooting. Capturing traces, performance monitor logs and other outputs and tracking down why the original bottleneck occurred. In this case “Why is the system taking so long to load the assembly or is the query timeout improperly set to something tiny?”

Bob Dorr - Principal SQL Server Escalation Engineer

AppDomain unloading messages flooding the SQL Server error log

Trending Articles

Practice Sheet of Right form of verbs for HSC Students

Download: FK ft Shenky – Nakuyewa ”Prod by: Shenky”

How to win at Markstrat (Markstrat Tips and Tricks) – Vodites

Ominde Commission Report and Recommendations – Ominde Report of 1964

Bureau of Internal Revenue: Regional Offices (Directory)

GO 53 on Enhancement of Ex-gratia upto 5 Lakhs Toddy Tappers in Telangana

Cakewalk CA-2A Leveling Amplifier v2.0.1.97 WiN, v2.0.1.96 OSX Incl Keygen

Mp3 Download: Mdu - Kunjenjenjena

How the kill the job , when DTP request running for long hours.

Microsoft Intune から展開しているアプリのアップデートについて

18-year-old girl was beaten for half an hour by two Northampton men in 'an...

Car crash in Dunton Bassett leaves driver in critical condition

Macky 2, Two Others In Road Accident

Application log 00000000000000089514: Could not convert queue DLVST90CLNT

Detroit mafia: D’Anna Brothers agree to plea deal

Delivery block field greyed out using VA02

Muloraki Au

【個人撮影】スマホのプライベート映像♪「中に出さないで///」カラオケ屋での生ハメ撮りが流出ｗ【リベンジポルノ】＠PornHub

BREAKING NEWS: Diamond Platnumz Is Reported Dead After Ghastly Car Accident

FIAT 500 B0111 B0112