Simplifying Availability Group Troubleshooting

The AlwaysOn_health event session in Extended Events is intended to make analyzing problems with Availability Groups possible after they have occurred.  While this event session goes a long way towards making it possible to piece together the puzzle of what went wrong in a lot of situations, it can still be a difficult task.  One of the things I wish Microsoft had included in the AlwaysON_health event session definition is the sqlserver.server_instance_name action for every event, and this is something that I usually recommend clients add to each of their AG servers using a script after I work with them the first time.  Why would this be useful, since if the files come from a specific server we should know the events are for that server right? Well, when we are working with AG configurations with more than two servers and trying to see the big picture of what is happening across the servers, it can be difficult to follow timelines with multiple files from multiple servers open.  It’s not impossible, but it does make things more difficult.

Merging Extended Events Files

When the UI for Extended Events was introduced in SSMS 2012, Microsoft included the ability to merge multiple XEL files into a single view in the UI, which can be really useful during AG problem analysis. All you have to do is grab the AlwaysOn_health*.xel files from each of the replica servers, copy them into a folder and point the UI to the folder using the File > Open > Merge Extended Event Files… menu item and you can see all of the events from all of the servers in a single place.

image

Adding server_instance_name to Events

EXCEPT…. there is no server instance name contained in the events by default so this becomes of limited use without modifying the AlwaysOn_health event session to add the sqlserver.server_instance_name action to all of the events.  This is easily accomplished in the UI using the multi-select options of the Configure tab on the event session properties as shown below, or may also be accomplished using the simple DDL script included at the bottom of this post.

image

ALTER EVENT SESSION [AlwaysOn_health] ON SERVER 
DROP EVENT sqlserver.alwayson_ddl_executed, 
DROP EVENT sqlserver.availability_group_lease_expired, 
DROP EVENT sqlserver.availability_replica_automatic_failover_validation, 
DROP EVENT sqlserver.availability_replica_manager_state_change, 
DROP EVENT sqlserver.availability_replica_state, 
DROP EVENT sqlserver.availability_replica_state_change, 
DROP EVENT sqlserver.error_reported, 
DROP EVENT sqlserver.hadr_db_partner_set_sync_state, 
DROP EVENT sqlserver.lock_redo_blocked;
GO

ALTER EVENT SESSION [AlwaysOn_health] ON SERVER 
ADD EVENT sqlserver.alwayson_ddl_executed(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_group_lease_expired(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_automatic_failover_validation(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_manager_state_change(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_state(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_state_change(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.error_reported(
    ACTION(sqlserver.server_instance_name)
    WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265) OR [error_number]=(41188) OR [error_number]=(41189))), 
ADD EVENT sqlserver.hadr_db_partner_set_sync_state(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.lock_redo_blocked(
    ACTION(sqlserver.server_instance_name))
GO

Once the event session is modified, the files can easily be merged together in the UI for all future troubleshooting and each event will have the originating server_instance_name attached to it by the action.

3 thoughts on “Simplifying Availability Group Troubleshooting

  1. Cool! Do you think this is more useful than just getting the full cluster logs from all the servers?

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.