The Curious Case of… the CLR assembly failure after an AG failover

(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)

Jonathan was working with a client recently who experienced a CLR assembly failure after an AG failover and needed to figure out why. They’d been testing their AG disaster recovery strategy and ran into an unexpected problem with their application which relies heavily on SQLCLR and an UNSAFE assembly that calls a web service from inside SQL Server.  When they failed over their AG to their DR server, the CLR assembly failed with the following error:

An error occurred in the Microsoft .NET Framework while trying to load assembly id 65546. The server may be running out of resources, or the assembly may not be trusted with PERMISSION_SET = EXTERNAL_ACCESS or UNSAFE. Run the query again, or check documentation to see how to solve the assembly trust issues. For more information about this error: System.IO.FileLoadException: Could not load file or assembly ‘sqlclr_assemblyname, Version=1.0.0.0, Culture=neutral, PublicKeyToken=fa39443c11b12591’ or one of its dependencies. Exception from HRESULT: 0x80FC80F1

To try and bypass this error, they executed the command ALTER DATABASE <DBNAME> SET TRUSTWORTHY ON to enable the trustworthy bit on the DR server.  They then tried the steps in KB Article 918040 and changed the database owner for the database on the DR server and then their CLR assembly began to work.

Well, at least it worked until they tried to failover to their original primary replica, and they again began to have problems with their CLR assembly.

Why would that be the case, especially since it originally worked on the primary replica before the DR failover?

It has to do with login SIDs in SQL Server and server-scoped permissions.  The database owner is mapped inside the database by the SID of the login on the server.  If the SID of the owner internally in the database doesn’t match a SID of a server principal on the server then the owner can’t be established.  The dbo SID internally in the database is replicated as a part of the AG, but the server login is not.  Also server scoped objects, like the asymmetric key used to sign the CLR assembly, are maintained in master, as is the login associated with that key and the EXTERNAL_ACCESS or UNSAFE ASSEMBLY permission associated with it.  So to fix this issue and get rid of the TRUSTWORTHY ON bit setting for the database they had to do the following steps:

  1. Create the asymmetric key from the assembly DLL on the DR server.
  2. Change the database owner to match the SID on both servers in sys.server_principals (script the dbo login using sp_help_revlogin to transfer with SID intact to both servers)
  3. Create login from asymmetric key on DR server and grant UNSAFE ASSEMBLY to match primary replica
  4. ALTER DATABASE <DBNAME> SET TRUSTWORTHY OFF
  5. Fail over to test between both sites

Bottom line: it’s *always* a good idea to regularly test your failover strategy as you never know what’s going to fail when you do! Kudos to this client for doing that.

Lazy log truncation or why VLFs might stay at status 2 after log clearing

 Earlier this year I was sent an interesting question about why the person was seeing lots of VLFs in the log with status = 2 (which means ‘active’) after clearing (also known as ‘truncating’) the log and log_reuse_wait_desc showed NOTHING.

I did some digging around and all I could find was an old blog post from 2013 that shows the behavior and mentions that this happens with mirroring and Availability Groups. I hadn’t heard of this behavior before but I guessed at the reason, and confirmed with the SQL Server team.

When an AG secondary is going to be added, at that point in time, the maximum LSN (the Log Sequence Number of the most recent log record) present in the restored copy of the database that will be the secondary must be part of the ‘active’ log on the AG primary (i.e. that LSN must be in a VLF on the primary that has status = 2). If that’s not the case, you need to restore another log backup on what will be the new secondary, and try the AG joining process again. Repeat until it works. You can see how for a very busy system, generating lots of log records and with frequent log backups (which clear the log on the primary), catching up the secondary enough to allow it to join the AG might be difficult, or necessitate temporarily stopping log backups on the primary (possibly opening up a window for increased data loss in a disaster).

To make this whole process easier, when a database is an AG primary, when log clearing occurs, the VLFs don’t go to status = 0; they remain ‘active’ with status = 2. So how does this help? Well, the fact that lots more VLFs are ‘active’ on the AG primary means that it’s more likely that the maximum LSN of a new secondary is still part of the ‘active’ log on the primary, and the AG-joining succeeds without having to repeat the restore-retry-the-join over and over.

(Note: the log manager knows that these VLFs are really ‘fake active’ and can reuse them as if they were ‘inactive’ if the log wraps around (see this post for an explanation) so there’s no interruption to regular operations on the AG primary.)

It’s a clever little mechanism that someone thought of to make a DBA’s life a bit easier and AGs less problematic to set up.

And now you know – log clearing won’t *always* set VLF statuses to zero.

SQLskills SQL101: Readable secondary performance problems

As Kimberly blogged about earlier this year, SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

Yesterday I blogged about log shipping performance issues and mentioned a performance problem that can be caused by using availability group readable secondaries, and then realized I hadn’t blogged about the problem, only described it in our Insider newsletter. So here’s a post about it!

Availability groups (AGs) are pretty cool, and one of the most useful features of them is the ability to read directly from one of the secondary replicas. Before, with database mirroring, the only way to access the mirror database was through the creation of a database snapshot, which only gave a single, static view of the data. Readable secondaries are constantly updated from the primary so are far more versatile as a reporting or non-production querying platform.

But I bet you didn’t know that using this feature can cause performance problems on your primary replica?

As with most things in life, you don’t get anything for free. Readable secondaries are really useful, but there is a performance trade off you need to be aware of. All queries that are executed against a readable secondary are automatically run using read-committed snapshot isolation. This means they do not require share locks and so will not block any database changes being replayed from the primary replica (i.e. the constant redo of log records on the secondary replica that have been sent from the primary replica).

To do this requires the use of the versioning system, where (simplistically) pre-change versions of records are copied into the version store in tempdb and queries work out which version of the record is the correct one for them to process, based on the query’s starting time. All records that change get a 14-byte tag added on the end of the record that allows a query to see if this is the correct record, and if not to follow a pointer to the previous version of the record in the version store. This has been the mechanism since snapshot isolation and read-committed snapshot isolation were introduced in SQL Server 2005.

Now consider this: all AG replicas are exact copies of the primary replica. So how can versioning work on the readable secondary, adding 14-byte tags to some records? That must break the ‘exact copy’ rule, right?

Well, yes, it would… if the primary replica didn’t also change.

When a readable secondary is configured in an AG environment, all changing records on the primary replica start getting empty 14-byte versioning tags added to them. This is so that the 14-bytes of extra space on the record is noted in the transaction log and replayed on the secondary replicas, allowing the readable secondary to make use of the empty 14-byte space to store the versioning tag it needs.

This doesn’t break the ‘exact copy’ rule because the 14-bytes isn’t used for anything to do with recovery, there just has to be 14-bytes there.

So versioning tags start getting added to changing records on the primary (to be clear, it doesn’t turn on versioning on the primary) so table and index records start to get 14-bytes longer. And what happens when records get longer on pages where there isn’t enough space? Page splits in your indexes (and forwarded records in heaps – but I’ll concentrate on indexes here) leading to low page densities (wasted disk space and buffer pool memory), logical fragmentation (poor scan performance), and a bunch of extra, expensive log record generation from the page splits themselves.

To counteract this, you’ll need to implement (and/or possibly lower existing) fill factors on your indexes and even potentially start doing index maintenance on indexes that may not have required it previously. Quite an insidious problem that can be hard to figure out unless you know what’s going on under the covers!

See the following blog posts for more info:

This MSDN page has more general information and this whitepaper from Microsoft explains in more depth the various performance impacts from using readable secondaries: AlwaysOn Solution Guide: Offloading Read-Only Workloads to Secondary Replicas.

If you’re implementing readable secondaries in your AG configuration, make sure that you also investigate and implement index fill factors in the database so that the versioning tags that are added under the covers don’t start causing page splits and fragmentation.