Simplifying Availability Group Troubleshooting

The AlwaysOn_health event session in Extended Events is intended to make analyzing problems with Availability Groups possible after they have occurred.  While this event session goes a long way towards making it possible to piece together the puzzle of what went wrong in a lot of situations, it can still be a difficult task.  One of the things I wish Microsoft had included in the AlwaysON_health event session definition is the sqlserver.server_instance_name action for every event, and this is something that I usually recommend clients add to each of their AG servers using a script after I work with them the first time.  Why would this be useful, since if the files come from a specific server we should know the events are for that server right? Well, when we are working with AG configurations with more than two servers and trying to see the big picture of what is happening across the servers, it can be difficult to follow timelines with multiple files from multiple servers open.  It’s not impossible, but it does make things more difficult.

Merging Extended Events Files

When the UI for Extended Events was introduced in SSMS 2012, Microsoft included the ability to merge multiple XEL files into a single view in the UI, which can be really useful during AG problem analysis. All you have to do is grab the AlwaysOn_health*.xel files from each of the replica servers, copy them into a folder and point the UI to the folder using the File > Open > Merge Extended Event Files… menu item and you can see all of the events from all of the servers in a single place.

image

Adding server_instance_name to Events

EXCEPT…. there is no server instance name contained in the events by default so this becomes of limited use without modifying the AlwaysOn_health event session to add the sqlserver.server_instance_name action to all of the events.  This is easily accomplished in the UI using the multi-select options of the Configure tab on the event session properties as shown below, or may also be accomplished using the simple DDL script included at the bottom of this post.

image

ALTER EVENT SESSION [AlwaysOn_health] ON SERVER 
DROP EVENT sqlserver.alwayson_ddl_executed, 
DROP EVENT sqlserver.availability_group_lease_expired, 
DROP EVENT sqlserver.availability_replica_automatic_failover_validation, 
DROP EVENT sqlserver.availability_replica_manager_state_change, 
DROP EVENT sqlserver.availability_replica_state, 
DROP EVENT sqlserver.availability_replica_state_change, 
DROP EVENT sqlserver.error_reported, 
DROP EVENT sqlserver.hadr_db_partner_set_sync_state, 
DROP EVENT sqlserver.lock_redo_blocked;
GO

ALTER EVENT SESSION [AlwaysOn_health] ON SERVER 
ADD EVENT sqlserver.alwayson_ddl_executed(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_group_lease_expired(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_automatic_failover_validation(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_manager_state_change(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_state(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_state_change(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.error_reported(
    ACTION(sqlserver.server_instance_name)
    WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265) OR [error_number]=(41188) OR [error_number]=(41189))), 
ADD EVENT sqlserver.hadr_db_partner_set_sync_state(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.lock_redo_blocked(
    ACTION(sqlserver.server_instance_name))
GO

Once the event session is modified, the files can easily be merged together in the UI for all future troubleshooting and each event will have the originating server_instance_name attached to it by the action.

Downgrading SQL Server Editions

At some point in your career working with SQL Server, you will run into a situation where the wrong edition of SQL Server has been installed on a server and will need to change the edition for licensing reasons.  Whether it is Enterprise Edition where Standard Edition should have been installed, Enterprise Edition where Developer Edition should have been used, or my favorite, Evaluation Edition where the 180 day trial has expired and Enterprise Edition isn’t going to be used, the only route available for downgrading the edition is to uninstall and reinstall SQL Server entirely.  SQL Server Setup makes upgrading editions a piece of cake with SKUUPGRADE as a command line option for going from Standard/Developer/Evaluation to Enterprise, but anything else requires a full uninstall and reinstall to change the SKU/Edition and then restore all of the system and user databases to the new instance, which typically means a lot of work.  I hate having to restore system databases and avoid having to do it if possible, so here is how I do this process and minimize the work required:

No matter what you are going to have to do an uninstall and reinstall of the SQL Server instance to downgrade the SKU.  However, you can save yourself some time and the headache of trying to restore the system databases if you are careful about what you do.  I have done a plenty of SKU downgrades in the past and the easiest way to do it, and I am not saying this is the Microsoft supported way but that it works if done correctly, is to:

  1. Take a good backup of all of your databases (system and user).  
  2. Run SELECT @@VERSION and note the specific build number of SQL Server that you are currently on.
  3. Shut down your existing instance of SQL Server.  
  4. Copy the master, model, and msdb database files (both mdf and ldf), don’t move them copy them, from the current location to a new folder that you mark as readonly. 
  5. Uninstall SQL Server from the system.
  6. Reboot the server.
  7. Install SQL Server Standard Edition.
  8. Apply the necessary Service Pack and/or Cumulative Updates to bring the instance up to your previous build number.
  9. Shutdown SQL Server.
  10. Copy the master, model, and msdb database files (both mdf and ldf) from the folder you saved them in to the correct location for the new install and remove the readonly flag from the files, and change the file ACL’s so that the SQL Service account has Full Control over the files.
  11. Startup SQL Server and if you did it correctly it will startup and be exactly where you were before you made any changes, with all of your user databases online and you should be ready to let applications connect and resume operations.

If you screw something up in the above, you still have your backups and you can run setup to rebuild the system databases and then go about following the Microsoft supported path for restoring the system databases and then user databases into the system to bring it online.  Essentially the file copy is no different that what would occur through attach/detach you are just doing it with system databases which is not explicitly supported, but it does work.  The key is to have your backups from before you do anything so you have the supported route available if you encounter an issue.  The only issue I have ever had doing this set of steps is that I didn’t set the file ACL’s correctly and the database engine threw Access Denied errors and failed to start until I fixed the ACL’s correctly.  This can save you many hours of frustration and downtime trying to restore everything since the database files are already there and it is just some small copy operations to put them where you need them to be.

SSMS Availability Group Sync Addin Update

The SSMS Availability Group addin I wrote a few years back for SQL Server 2012 has been updated and released with SSMS 2014 support available.  This has been something that has been highly requested and I’ve been through a couple of rounds of beta tests privately, so I’ve put together a new installer that will install the 2012 version and the 2014 version side by side.  You can get the new installer from the download page for the Addin in our Free Tools section. If you have the older version of the Addin, you might need to uninstall it first for the newer updates to the 2012 version to be installed.  There is no current support for SSMS 2016 due to the change in VS shell platforms to VS2015 which requires a complete overhaul of the Addin to a VS Extension (VSIX) package.