Baselines for SQL Server and Azure SQL Database

Last week I got an email from a community member who had read this older article of mine on baselining, and asked if there were any updates related to SQL Server 2016, SQL Server 2017, or vNext (SQL Server 2019). It was a really good question. I haven’t visited that article in a while and so I took the time to re-read it. I’m rather proud to say that what I said then still holds up today.

The fundamentals of baselining are the same as they were back in 2012 when that article was first published. What is different about today? First, there are a lot more metrics in the current release of SQL Server that you can baseline (e.g. more events in Extended Events, new DMVs, new PerfMon counters,  sp_server_diagnostics_component_results). Second, options for capturing baselines have changed. In the article I mostly talked about rolling your own scripts for baselining. If you’re looking to establish baselines for your servers you still have the option to develop your own scripts, but you also can use a third-party tool, and if you’re running SQL Server 2016+ or Azure SQL Database, you can use Query Store.

As much as I love Query Store, I admit that it is not all-encompassing in terms of baselining a server. It does not replace a third-party tool, nor does it fully replace rolling your own scripts. Query Store captures metrics specific to query execution, and you’re not familiar with this feature, feel free to check out my posts about it.

Consider this core question: What should we baseline in our SQL Server environment? If you have a third-party tool, the data captured is determined by the application, and some of them allow you to customize and capture additional metrics. But if you roll your own scripts, there are some fundamental things that I think you should capture such as instance configuration, file space and usage information, and wait statistics.

Beyond that, it really goes back to the question of what problem are you trying to solve? If you are looking at implementing In-Memory OLTP, then you want to capture information related to query execution times and frequency, locking, latching, and memory use. After you implement In-Memory OLTP, you look at those exact same metrics and compare the data. If you’re looking at using Columnstore indexes, you need to look at query performance as it stands right now (duration, I/O, CPU) and capture how it changes after you’ve added one or more Columnstore indexes. But to be really thorough you should also look at index usage for the involved tables, as well as query performance for other queries against those tables to see if and/or how performance changes after you’ve added the index. Very few things in SQL Server work truly in isolation, they’re all interacting with each other in some way…which is why baselining can be a little bit overwhelming and why I recommend that you start small.

Back to the original question: is there anything new to consider with SQL Server 2016 and higher? While third-party tools continue to improve and more metrics are available as new features are added and SQL Server continues to evolve, the only thing “really new” is the addition of Query Store and its ability to capture query performance metrics natively within SQL Server. Hopefully this helps as you either look at different third-party tools that you may want to purchase, or you look at rolling your own set of scripts.  If you’re interested in writing your own scripts, I have a set of references that might be of use here.

Lastly, you’ll note that I haven’t said much about Azure SQL Database, and that’s because it’s an entirely different beast.  If you have one or more Azure SQL Databases, then you may know that within the Portal there are multiple options for looking at system performance, including Intelligent Insights and Query Performance Insight.  Theoretically, you could still roll your own scripts in Azure SQL DB, but I would first explore what Microsoft provides to see if it meets your needs.  Have fun!

Removing a database from a replica in an Availability Group

I recently had a scenario in a two-node Availability Group where multiple large-batch modification queries were executed and created a large redo queue on the replica.  The storage on the replica is slower than that on the primary (not a desired scenario, but it is what it is) and the secondary has fallen behind before, but this time it was to the point where it made more sense remove the database from the replica and re-initialize, rather than wait several hours for it to catch up.  What I’m about detail is not an ideal solution.  In fact, your solution should be architected to avoid this scenario entirely (storage of equal capability for all involved nodes is essential).  But, stuff happens (e.g., a secondary database unexpectedly pausing), and the goal was to get the replica synchronized again with no downtime.

In my demo environment I have two nodes, CAP and BUCKY.  CAP is the primary, BUCKY is the replica, and there are two databases, AdventureWorks2012 and WideWorldImporters in my TestLocation AG:

Availability Group (named TestLocation) Configuration

Availability Group (named TestLocation) Configuration

In this case, my WideWorldImporters database is the one that’s behind on the secondary replica, so this is the database we want to remove and then re-initialize.  On the secondary (BUCKY) we will remove WideWorldImporters from the AG with this TSQL:

USE [master];
GO

ALTER DATABASE [WideWorldImporters]
     SET HADR OFF;
GO

You can also do this in the UI, if you right-click on the database within the AG and select Remove Secondary Database, but I recommend scripting it and then running it (screen shot for reference):

Removing WideWorldImporters from the AG via SSMS

Removing WideWorldImporters from the AG via SSMS

After removing the database, it will still be listed for the AG but it will have a red X next to it (don’t panic).  It will also be listed in the list of Databases, but it will have a status of Restoring…

WideWorldImporters database removed on the secondary replica

WideWorldImporters database removed on the secondary replica

If you check the primary, the WideWorldImporters database there is healthy:

Database and AG health on the primary

Database and AG health on the primary

You can still access WideWorldImporters as it’s part of the AG and using the Listener.  The system is still available, but I’m playing without a net.  If the primary goes down, I will have not have access to the WideWorldImporters database.  In this specific case, this was a risk I was willing to take (again, because the time to restore the database was less than the time it would take the secondary to catch up).  Also note that because this database is in an Availability Group by itself, the transaction log will be truncated when it’s backed up.

At this point, you want to kick off a restore of the most recent full backup of the database on the replica (BUCKY):

USE [master];
GO

RESTORE DATABASE [WideWorldImporters]
     FROM  DISK = N'C:\Backups\WWI_Full.bak'
     WITH  FILE = 1,
     MOVE N'WWI_Primary' TO N'C:\Databases\WideWorldImporters.mdf',
     MOVE N'WWI_UserData' TO N'C:\Databases\WideWorldImporters_UserData.ndf',
     MOVE N'WWI_Log' TO N'C:\Databases\WideWorldImporters.ldf',
     MOVE N'WWI_InMemory_Data_1' TO N'C:\Databases\WideWorldImporters_InMemory_Data_1',
     NORECOVERY,
     REPLACE,
     STATS = 5;

GO

Depending on how long this takes, at some point I disable the jobs that run differential or log backups on the primary (CAP), and then manually kick off a differential backup on the primary (CAP).

USE [master];
GO

BACKUP DATABASE [WideWorldImporters]
     TO  DISK = N'C:\Backups\WWI_Diff.bak'
     WITH  DIFFERENTIAL ,
     INIT,
     STATS = 10;
GO

Next, restore the differential on the replica (BUCKY):

USE [master];
GO

RESTORE DATABASE [WideWorldImporters]
     FROM  DISK = N'C:\Backups\WWI_Diff.bak'
     WITH  FILE = 1,
     NORECOVERY,
     STATS = 5;
GO

Finally, take a log backup on the primary (CAP):

USE [master];
GO

BACKUP LOG [WideWorldImporters]
     TO  DISK = N'C:\Backups\WWI_Log.trn'
     WITH NOFORMAT,
     INIT,
     STATS = 10;
GO

And then restore that log backup on the replica (BUCKY):

USE [master];
GO

RESTORE LOG [WideWorldImporters]
     FROM  DISK = N'C:\Backups\WWI_Log.trn'
     WITH  FILE = 1,
     NORECOVERY,
     STATS = 5;
GO

At this point, the database is re-initialized and ready to be added back to the Availability Group.

Now, when I ran into this the other day, I also wanted to apply a startup trace flag to the primary replica and restart the instance.  I also wanted to make sure that the AG wouldn’t try to failover when the instance restarted, so I temporarily changed the primary to manual failover (executed on CAP, screenshot for reference):

USE [master];
GO

ALTER AVAILABILITY GROUP [TestLocation]
     MODIFY REPLICA ON N'CAP\ROGERS' WITH (FAILOVER_MODE = MANUAL);
GO
Change Failover Mode for the AG Temporarily

Change Failover Mode for the AG Temporarily

I restarted the instance, confirmed my trace flag was in play, and then changed the FAILOVER_MODE back to automatic:

USE [master];
GO

ALTER AVAILABILITY GROUP [TestLocation]
     MODIFY REPLICA ON N'CAP\ROGERS' WITH (FAILOVER_MODE = AUTOMATIC);
GO

The last step is to join the WideWorldImporters database on the replica back to the AG:

ALTER DATABASE [WideWorldImporters]
     SET HADR AVAILABILITY GROUP = TestLocation;
GO

After joining the database back to the AG, be prepared to wait for the databases to synchronize before things look healthy.  Initially I saw this:

Secondary database joined to AG, but not synchronized

Secondary database joined to AG, but not synchronized

Transactions were still occurring on the primary between the time of the log being applied on the secondary (BUCKY) and the database being joined to the AG from the secondary.  You can check the dashboard to confirm this:

Secondary database added to AG, transactions being replayed on secondary

Secondary database added to AG, transactions being replayed on secondary

Once the transactions had been replayed, everything was synchronized and healthy:

Databases synchronized (dashboard on primary)

Databases synchronized (dashboard on primary)

Databases synchronized (connected to secondary)

Databases synchronized (connected to secondary)

Once the databases are synchronized, make sure to re-enable the jobs that run differential and log backups on the primary (CAP).  In the end, removing a database from a replica in an Availability Group (and then adding it back) is probably not something you will need to do on a regular basis.  This is a process worth practicing in a test environment at least once, so you’re comfortable with it should the need arise.

Plan Forcing in SQL Server

Last month I was in Portugal for their SQLSaturday event, and I spent a lot of time talking about Plan Forcing in SQL Server – both manual and automatic (via the Automatic Plan Correction feature). I had some really great questions from my pre-con and regular session and wanted to summarize a few thoughts on Plan Forcing functionality.

Forcing plans in SQL Server provides a very easy method for DBAs and developers to stabilize query performance. But plan forcing is not a permanent solution.  Consider the premise on which plan forcing relies: multiple plans exist for a query and one of them provides the most consistent performance.  If I have high variability in query performance, ideally, I want to address that in the code or through schema changes (e.g. indexing).  Forcing a plan for a query is a lot like creating a plan guide – they are similar but they are two separate features – in that it’s a temporary solution.  I also view adding OPTION (RECOMPILE) as a temporary solution. Some of you might be shocked at that, but when I see a RECOMPILE on a query, I immediately ask why it was added, when it was added, and I start looking at what can be done to remove it.

Knowing that this is how I view plan forcing, how do I decide when to force a plan?  When the query has variability in performance.

Consider Query A, which generates multiple plans, but they’re all about the same in terms of duration, I/O, and CPU.  The performance across the different plans is consistent.  I won’t force a plan for that query.

Query with multiple, consistent plans

Query with multiple, consistent plans

Next consider Query B, which also generates different plans, and some are stable but a couple are over the place in terms of duration, I/O, and CPU.  Maybe a couple plans provide good performance, but the rest are awful.  Would I force one of the “good plans”?  Probably – but I’d do some testing first.

Query with multiple plans that have variable performance

Query with multiple plans that have variable performance

Understand that if I force a plan for a query, that’s the plan that’s going to get used unless forcing fails for some reason (e.g. the index no longer exists).  But does that plan work for all variations of the query?  Does that plan provide consistent performance for all the different input parameters that can be used for that query?  This requires testing…and oh by the way, concurrent with any testing/decision to force a plan I’m talking to the developers about ways to address this long-term.

Now, out of my entire workload, if I have many queries that have multiple plans, where do I start?  With the worst offenders.  If I’m resource-bound in some way (e.g. CPU or I/O), then I would look at queries with the highest resource use and start working through those.  But I also look for the “death by a thousand cuts” scenario – the queries which execute hundreds or thousands of times a minute. As an aside, during the pre-con in Portugal one of the attendees had me look at a query in Query Store in the production environment.  There was concern because the query had multiple plans.  I pointed out that the query had executed 71,000 times in an hour…which is almost 20 times a second.  While I want to investigate multiple plans, I also want to know why a query executes so often.

Thus far, I’ve talked about a workload…one workload.  What about the environment where you support hundreds of SQL Server instances?  You can obviously take the approach I’ve detailed above, which requires a review of poor-performing queries with multiple plans and deciding which plan (if any) to force until development addresses the issue.  Or, if you’re running SQL Server 2017 Enterprise Edition, you could look at Automatic Plan Correction, which will force a plan for a query (without human intervention) if there’s a regression.  I wrote a post (Automatic Plan Correction in SQL Server) on SQLPerformance.com about this feature, so I’m not going to re-hash the details here.

Whether you force plans manually, or let SQL Server force them with the Automatic Plan Correction feature, I still view plan forcing as a temporary solution.  I don’t expect you to have plans forced for years, let alone months.  The life of a forced plan will, of course, depend on how quickly code and schema changes are ported to production.  If you go the “set it and forget it route”, theoretically a manually forced plan could get used for a very long time.  In that scenario, it’s your responsibility to periodically check to ensure that plan is still the “best” one for the query.  I would be checking every couple weeks; once a month at most.  Whether or not the plan remains optimal depends on the tables involved in the query, the data in the tables, how that data changes (if it changes), other schema changes that may be introduced, and more.

Further, you don’t want to ignore forced plans because there are cases where a forced plan won’t be used (you can use Extended Events to monitor this).  When you force a plan manually, forcing can still fail.  For example, if the forced plan uses an index and the index is dropped, or its definition is changed to the point where it cannot be used in plan in the same manner, then forcing will fail. Important note: if forcing fails, the query will go through normal optimization and compilation and it will execute; SQL Server does not want your query to fail!  If you’re forcing plans and not familiar with the reasons that it can fail, note the last_force_failure_reason values listed for sys.query_store_plan.  If you have manually forced a plan for a query, and the force plan fails, it remains forced.  You have to manually un-force it to stop SQL Server from trying to use that plan.  As you can see, there are multiple factors related to plan forcing, which is why you don’t just force a plan and forget it.

This behavior is different if you’re using Automatic Plan Correction (APC).  As mentioned in the Automatic tuning documentation, if a plan is automatically forced, it will be automatically un-forced if:

  • forcing fails for any reason
  • if there is a performance regression using the forced plan
  • if there is a recompile due to a schema change or an update to statistics.

With APC, there is still work to be done – here you want to use Extended Events or sys.dm_db_tuning_recommendations to see what plans are getting forced, and then decide if you want to force them manually. If you force a plan manually it will never be automatically un-forced.

There are a lot of considerations when you embrace plan forcing – I think it’s an excellent alternative to plan guides (much easier to use, not schema bound) and I think it’s absolutely worth a DBA or developer’s time to investigate what plan to force, and then use that as a temporary solution until a long-term fix can be put in place.  I hope this helps those of you that have been wary to give it a try!