Backup your Reporting Services Encryption Key

If you run SQL Server Reporting Services, part of your DR plan needs to include a backup of the encryption key for SSRS. This sadly is an all to often overlooked part of the solution, even though it is incredibly easy to do. If you don’t have a backup of the encryption key during a restore, the report server will never be able to decrypt the encrypted content (connection strings, passwords, etc) stored in the database, and your only recourse would be to delete the encrypted content and recreate it manually or through a redeployment of datasources.

https://docs.microsoft.com/en-us/sql/reporting-services/install-windows/ssrs-encryption-keys-back-up-and-restore-encryption-keys?view=sql-server-2017

For those that are into Powershell, Microsoft has provided a simple function Backup-RsEncryptionKey that can be utilized for this as well:

https://www.powershellgallery.com/packages/ReportingServicesTools/0.0.0.2/Content/Functions%5CAdmin%5CBackup-RsEncryptionKey.ps1

Make sure that you’ve covered all your bases with backups of other keys and certificates as well. Even if you think you’ve got your bases covered, now is a great time to verify, especially for anyone using Transparent Data Encryption for a database. One of the worst emails we’ve ever gotten from someone in the community was that they had backups of their database but not the certificate used for encryption and they couldn’t restore the database or access the data, and there is absolutely nothing anybody can do to help you if this happens.

Availability Group Readable Secondaries – Just Say No

Since SQL Server 2012, Microsoft has put a lot of efforts into marketing AlwaysOn Availability Groups as a feature and solution for both high availability/disaster recovery for SQL Server, and scalability for read-only workloads.  Over the last six years, I have talked more clients out of using Availability Groups than I have actually implemented for SQL Server for a lot of different reasons.  As easy as Microsoft has made it to build an Availability Group, restore and synchronize the databases, join secondary servers, etc. the simple fact is that once they are in production usage, you had better know what not to do that could break things, the impact having a readable secondary, and how to troubleshoot problems when they occur, because they will.

A lot of the implementation questions I get for Availability Groups are generally looking at leveraging a secondary replica for High Availability or Disaster Recovery and often times also for offloading reporting or readable workloads.  Readable Secondary copies of the database are really simple at the surface conceptually, but there are a lot of additional considerations I make that could lead you to another technology entirely like Transactional Replication or Log Shipping with Standby. Let’s look at how these three stack up against each other for different data access requirements.

Real-time Data

First things first, for real-time data access, you only have one option, the primary writeable copy of the database. In replication this is the publisher database, log shipping calls it the log shipped primary, and Availability Groups refers to it as the primary replica.  None of the secondary copies of the data, even in a synchronous Availability Group replica, is ever guaranteed to actually be real-time data.  A synchronous Availability Group only guarantees that the log records for an operation are hardened in the synchronous secondary replicas transaction log allowing them to later be redone on the secondary database when a transaction commits.  Depending on the volume of activity that is occurring, and whether the secondary replica is being used for read-only workloads or has sufficient hardware, the redo thread can typically be anywhere from a few hundred milliseconds to a few seconds behind the actual primary replica.  This means that if you insert data into a table, scope the identity value and return it back to the application, then attempt to read that record immediately from the read-only secondary, it could actually not be there (Yes, this actually does happen).

Near Real-time Data

For near real time data access, both Replication and the Availability Group can meet the requirement for near real-time data. Replication typically will have slightly more latency than an Availability Group readable secondary in either commit mode by a few seconds due to the way replication functions.  The log reader agent has to read the transaction log, write replication changes to the distribution database, and then the distribution agent has to pickup those changes up and apply them to the subscriber.  This typically has a latency of 3-7 seconds under normal conditions, where an Availability Group readable secondary could have less latency depending on the volume of activity. The big difference between the two is entirely in the implementation and features supported.

If you need a non-clustered index to support the read-only workload, that index must be created on the primary database of an Availability Group for it to be available on the readable secondary.  This means you have to backup and maintain that index and the maintenance will affect all servers in the Availability Group as a logged operation requiring redo on secondary replicas.  This is not the case with transactional replication, where the subscriber database is a writeable copy of the database and can have a completely different indexing design from the publisher database that is based on just the read-only workload.  Since the subscriber database is writeable, that also means that you can use features like Query Store, plan forcing, and automatic plan correction for the read-only workload; all of these are unavailable for a readable secondary replica in an Availability Group because the database is an exact duplicate of the primary (which can be enabled for Query Store and use these features).

Another consideration is the impact of the 14-byte row-version tag generation to support Read Committed Snapshot Isolation (RCSI) for the read-only replica in an Availability Group.  This can only be generated on the primary replica database, but is not required on the publisher database for transactional replication. Why does this matter? It can lead to page splits that cause logical fragmentation of the indexes. Moving to a readable secondary may require monitoring index fragmentation and introducing fill factor changes where you wouldn’t have needed them previously. If you have read the incredibly bad information online telling you that you can ignore index fragmentation entirely with SSDs, or that you don’t need to rebuild indexes with an Availability Group, consider this: logical scan fragmentation may not matter much from a performance standpoint on fast storage, but a low page density caused by page splits and fragmentation affects every single replica copy and wastes space on disk and in memory in the buffer pool, and affects plan select due to query costing by the optimizer. With replication, all of this is avoided entirely unless you use RCSI for your publisher workload anyway, but that does not impact the subscribers in anyway since they are only replicated copies of the data modification operations, not the pages themselves.

Point-in-time Data

For point in time data access, I often go low tech and suggest using Log Shipping with Standby and delayed restore to create a warm copy of the database that is restored to a point in time.  This can be useful for accounting reporting on the previous days sales transactions, performing ETL into a data mart or data warehouse, and it also provides the benefit of having a second copy of the database at a known point in time for DR purposes.  You can also do this type of reporting off of the near real-time data provided by Replication and Availability Groups typically as well. The table below breaks down the features and requirements of each of these three technologies for providing a readable copy of the database.

Real Time DataSecondary DelayReadable Workload IndexesRow-version Tag OverheadQuery Store SupportAutomatic Plan CorrectionPartial Data Only
Availability Groups (Sync)Primary OnlyDepends on Redo typically within a few secondsMust be created on PrimaryPrimary database with Readable SecondaryPrimary OnlyPrimary Only

No

Availability Groups (Async)Primary OnlyDepends on Log Send Queue and Redo typically within a few secondsMust be created on PrimaryPrimary database with Readable SecondaryPrimary OnlyPrimary Only

No

ReplicationPublisher OnlyDepends on log reader and distribution agent typically within a few secondsMay be created on subscriber copy onlyOnly if RCSI enabledPublisher and SubscribersPublisher and Subscribers

Yes

Log ShippingPrimary OnlyDepends on restore job configuration and standby – Point in time only, updates require disconnecting usersMust be created on PrimaryOnly if RCSI enabledPrimary OnlyPrimary Only

No

Complexity and Manageability

So what about complexity and manageability of these different options?

In the opening paragraph of this post I pointed out how easy Microsoft has made it to build an Availability Group to handle read-only copies of a database or group of databases. Setting up an Availability Group is really quite simple but that doesn’t mean it’s not lacking in complexity.  First you need to know what kind of Availability Group are you going to implement.  What started as a single feature has morphed over each release into:

  • Basic Availability Groups – Standard Edition (single database)
  • Enterprise Availability Groups – Enterprise Edition (one or many databases)
  • Distributed Availability Groups – Support for up to 17 readable secondaries without daisy-chaining
  • Domain Independent Availability Groups – Environments without Active Directory
  • Read-scale Availability Groups – No clustering requirement or HA

Add Linux into the mix of options as well as whether or not you need failover clustering as a part of that configuration, and this isn’t as simple as it seems.  One of the common areas of failure with Availability Groups that I’ve seen is with choosing the correct quorum configuration for the Windows Failover Cluster configuration, though Microsoft has made this easier in Windows Server 2012 R2 and higher. An error in quorum configuration can result in your High Availability solution failing and taking the databases offline as a result.

Replication tends to get a bad rap when it comes to complexity and manageability. The reality is that the technology just isn’t well understood by most SQL Server professionals, and it is really easy to configure (there’s a great UI in SSMS for it also), and as long as you don’t play with it or modify data on the subscriber databases, it usually just works without a hitch. Change management is something that has to be planned out with Replication, where it wouldn’t have to be with Availability Groups.  New tables have to be added to replication manually as an article, and certain operations like dropping and recreating a table or truncating a table are not allowed when the table is being replicated (sorry, no more using SSMS designer to insert a column in the middle of a table). Compared to Availability Groups, just for change management, Replication is much more complex and requires more work to manage over time, but the types of changes that require manual intervention in Replication typically are not occurring at a frequent pace for established applications and databases.

Log Shipping is just stupid simple low-tech and easy to manage. I mean for real how difficult is taking a backup and restoring it on another server? You might be surprised, but the configuration wizard and Agent Jobs automate this so that you typically don’t have to do anything but let it do its thing. Even third party backup tools offer their own flavor of implementation of log shipping as a part of their backup products for SQL Server typically. If something ever gets out of sync for log shipping, it usually means that someone took a manual log backup of the database, or switched recovery models and broke the log chain, and the standby database will have to be reinitialized from a new FULL backup of the primary to resume the restores (or you can manually apply the missing backup if a manual log backup was performed to close the gap).

Summary

While Availability Groups have become the Go To technology for providing a readable copy of data for SQL Server, they are not the only feature available to meet this requirement.  It is important to understand the requirements and the limitations of any feature that you are considering using, and to also evaluate the other available features that might be a better fit for the needs of the environment.  I have used transactional replication for nearly 10 years to solve read-only workload and reporting requirements for SQL Server and continue to use it over Availability Groups where as a technology, it is just a better fit for the requirements.  Being able to reduce indexing on the primary writeable copy of the database and only having readable workload non-clustered indexes on the subscribers can be a huge benefit for some workloads. For one client it was the difference between a 800GB publisher database and a 1.7TB primary AG replica which affected their backup/restore times and RTO SLAs.  Collect the requirements, evaluate the technologies and limitations, then choose the appropriate way to solve a problem.

CPU Ready Impact on SOS_SCHEDULER_YIELD

At the end of May 2017, Paul and I had a discussion about SOS_SCHEDULER_YIELD waits and whether or not they could become skewed in SQL Server as a result of hypervisor scheduling issues for a VM running SQL Server.  I did some initial testing of this in my VM lab using both Hyper-V and VMware and was able to create the exact scenario that Paul was asking about, which he explains in this blog post.  I’ve recently been working on reproducing another VM scheduling based issue that shows up in SQL Server, where again the behaviors being observed are misleading as a result of the hypervisor being oversubscribed, only this time it was not for the SQL Server VM, it was instead for the application servers, and while I was building a repro of that in my lab, I decided to take some time and rerun the SOS_SCHEDULER_YIELD tests and write a blog post to show the findings in detail.

The test environment that I used for this is a portable lab I’ve used for demos for VM content over the last eight years teaching our Immersion Events. The ESX host has 4 cores and 8GB RAM and hosts three virtual machines, a 4vCPU SQL Server with 4GB RAM, and two 2vCPU Windows Server VM’s with 2GB RAM that are used strictly to run Geekbench to produce load.  Within SQL Server, I a reproducible workload that drives parallelism and is repeatable consistently, that I have also used for years in teaching classes.

For the tests, I first ran a baseline where the SQL Server VM is the only machine executing any tasks/workload at all, but the other VMs are powered on and just sitting there on the hose.  This establishes a base metric in the host and had 1% RDY time average during the workload execution, after which I collected the wait stats for SOS_SCHEDULER_YIELD.

clip_image002

Then I reran the tests but ran four copies of GeekBench on one of the application VM’s to drive the CPU usage for that VM and keep it having to be scheduled by the hypervisor, and then reran the SQL Server workload.  This put the SQL Server VM at 5% RDY time in the hypervisor during the tests, which is my low watermark for where you would expect performance issues to start showing.  When the SQL workload completed, I recorded the waits again.

clip_image004

Then I repeated the test again, but with both of the application VM’s running four copies of Geekbench.  The SQL Server VM had an average of 28% RDY time, and when the workload completed the waits were recorded a final time.

clip_image006

As you can see below there is a direct increase in the wait occurrence with the increase in RDY%.

TestAVG RDY%wait_typewaiting_tasks_countwait_time_msmax_wait_time_mssignal_wait_time_msDuration
2 APP VM28.4SOS_SCHEDULER_YIELD1045522893124228033:59
1 APP VM5.1SOS_SCHEDULER_YIELD251436186236081:57
Baseline1.2SOS_SCHEDULER_YIELD1392974129631:32

High CPU Ready time is something that should be consistently monitored for with SQL Server VM’s where low latency response times are required.  It also should be monitored for application VMs where latency is important as well.  One thing to be aware of with any of the wait types on a virtualized SQL Server, not just the SOS_SCHEDULER_YIELD wait, is that the time spent waiting is going to also include time the VM has to wait to be scheduled by the hypervisor as well.  The SOS_SCHEDULER_YIELD wait type is unique in that the number of occurrences of the wait type increases because of the time shifts that happen causing quantum expiration internally for a thread inside of SQLOS forcing it to yield, even if it actually didn’t get to make progress due to the hypervisor not scheduling the VM.