Simplifying Availability Group Troubleshooting

The AlwaysOn_health event session in Extended Events is intended to make analyzing problems with Availability Groups possible after they have occurred.  While this event session goes a long way towards making it possible to piece together the puzzle of what went wrong in a lot of situations, it can still be a difficult task.  One of the things I wish Microsoft had included in the AlwaysON_health event session definition is the sqlserver.server_instance_name action for every event, and this is something that I usually recommend clients add to each of their AG servers using a script after I work with them the first time.  Why would this be useful, since if the files come from a specific server we should know the events are for that server right? Well, when we are working with AG configurations with more than two servers and trying to see the big picture of what is happening across the servers, it can be difficult to follow timelines with multiple files from multiple servers open.  It’s not impossible, but it does make things more difficult.

Merging Extended Events Files

When the UI for Extended Events was introduced in SSMS 2012, Microsoft included the ability to merge multiple XEL files into a single view in the UI, which can be really useful during AG problem analysis. All you have to do is grab the AlwaysOn_health*.xel files from each of the replica servers, copy them into a folder and point the UI to the folder using the File > Open > Merge Extended Event Files… menu item and you can see all of the events from all of the servers in a single place.

image

Adding server_instance_name to Events

EXCEPT…. there is no server instance name contained in the events by default so this becomes of limited use without modifying the AlwaysOn_health event session to add the sqlserver.server_instance_name action to all of the events.  This is easily accomplished in the UI using the multi-select options of the Configure tab on the event session properties as shown below, or may also be accomplished using the simple DDL script included at the bottom of this post.

image

ALTER EVENT SESSION [AlwaysOn_health] ON SERVER 
DROP EVENT sqlserver.alwayson_ddl_executed, 
DROP EVENT sqlserver.availability_group_lease_expired, 
DROP EVENT sqlserver.availability_replica_automatic_failover_validation, 
DROP EVENT sqlserver.availability_replica_manager_state_change, 
DROP EVENT sqlserver.availability_replica_state, 
DROP EVENT sqlserver.availability_replica_state_change, 
DROP EVENT sqlserver.error_reported, 
DROP EVENT sqlserver.hadr_db_partner_set_sync_state, 
DROP EVENT sqlserver.lock_redo_blocked;
GO

ALTER EVENT SESSION [AlwaysOn_health] ON SERVER 
ADD EVENT sqlserver.alwayson_ddl_executed(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_group_lease_expired(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_automatic_failover_validation(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_manager_state_change(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_state(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_state_change(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.error_reported(
    ACTION(sqlserver.server_instance_name)
    WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265) OR [error_number]=(41188) OR [error_number]=(41189))), 
ADD EVENT sqlserver.hadr_db_partner_set_sync_state(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.lock_redo_blocked(
    ACTION(sqlserver.server_instance_name))
GO

Once the event session is modified, the files can easily be merged together in the UI for all future troubleshooting and each event will have the originating server_instance_name attached to it by the action.

Database Master Keys and Availability Groups

Recently I received an email from a client experimenting with Availability Groups for the first time in a development environment that had run into an issue with adding one database out of fifteen to the Availability Group in the UI.  The database in question had Password required beside it in the UI instead of the Meets prerequisites link all the other databases had. Clicking the Password required link produced a non-informative popup dialog saying “This database is encrypted by database master key, you need to provide valid password when adding it to the availability group.”

image

This was something I ran into a couple of years ago, and I did an Insider Demo video of this UI behavior for our SQLskills Insider Newsletter to demonstrate how the UI changed in SSMS 2016 to support encrypted databases and databases with encrypted content not using TDE as well. This is not caused by TDE, but by a database master key (DMK) that is protecting something like a certificate or an encrypted column in a table. The UI actually has a Password column next to the Status column, though it’s not very intuitive or clear that there are text boxes available for each database that you can click on to provide the password for the database master key.

clip_image002

In the screenshot above, the SSISDB requires a password to be added to the Availability Group, and the password has been provided in the Password column of the UI.  To do this, actually requires a series of steps that aren’t exactly intuitive. 

  1. First, double click in the Password column and it will become a text box that is editable. 
  2. Type the password in.
  3. Click Refresh at the bottom of the screen which will make the check box for the database enabled.
  4. Check the checkbox for the database.
  5. Click Refresh a second time, which will enable the Next button at the bottom.
  6. Click Next to progress to the next screen in the wizard.

This took me 20 minutes to figure out in the back of the room during one of our Immersion Events after SQL Server 2016 released so I could answer a question for someone that wasn’t even related to this UI functionality about the SSISDB catalog and how it behaved in an Availability Group if an failover occurs during package execution. 

SSMS Availability Group Sync Addin Update

The SSMS Availability Group addin I wrote a few years back for SQL Server 2012 has been updated and released with SSMS 2014 support available.  This has been something that has been highly requested and I’ve been through a couple of rounds of beta tests privately, so I’ve put together a new installer that will install the 2012 version and the 2014 version side by side.  You can get the new installer from the download page for the Addin in our Free Tools section. If you have the older version of the Addin, you might need to uninstall it first for the newer updates to the 2012 version to be installed.  There is no current support for SSMS 2016 due to the change in VS shell platforms to VS2015 which requires a complete overhaul of the Addin to a VS Extension (VSIX) package.