Simplifying Availability Group Troubleshooting

The AlwaysOn_health event session in Extended Events is intended to make analyzing problems with Availability Groups possible after they have occurred.  While this event session goes a long way towards making it possible to piece together the puzzle of what went wrong in a lot of situations, it can still be a difficult task.  One of the things I wish Microsoft had included in the AlwaysON_health event session definition is the sqlserver.server_instance_name action for every event, and this is something that I usually recommend clients add to each of their AG servers using a script after I work with them the first time.  Why would this be useful, since if the files come from a specific server we should know the events are for that server right? Well, when we are working with AG configurations with more than two servers and trying to see the big picture of what is happening across the servers, it can be difficult to follow timelines with multiple files from multiple servers open.  It’s not impossible, but it does make things more difficult.

Merging Extended Events Files

When the UI for Extended Events was introduced in SSMS 2012, Microsoft included the ability to merge multiple XEL files into a single view in the UI, which can be really useful during AG problem analysis. All you have to do is grab the AlwaysOn_health*.xel files from each of the replica servers, copy them into a folder and point the UI to the folder using the File > Open > Merge Extended Event Files… menu item and you can see all of the events from all of the servers in a single place.

image

Adding server_instance_name to Events

EXCEPT…. there is no server instance name contained in the events by default so this becomes of limited use without modifying the AlwaysOn_health event session to add the sqlserver.server_instance_name action to all of the events.  This is easily accomplished in the UI using the multi-select options of the Configure tab on the event session properties as shown below, or may also be accomplished using the simple DDL script included at the bottom of this post.

image

ALTER EVENT SESSION [AlwaysOn_health] ON SERVER 
DROP EVENT sqlserver.alwayson_ddl_executed, 
DROP EVENT sqlserver.availability_group_lease_expired, 
DROP EVENT sqlserver.availability_replica_automatic_failover_validation, 
DROP EVENT sqlserver.availability_replica_manager_state_change, 
DROP EVENT sqlserver.availability_replica_state, 
DROP EVENT sqlserver.availability_replica_state_change, 
DROP EVENT sqlserver.error_reported, 
DROP EVENT sqlserver.hadr_db_partner_set_sync_state, 
DROP EVENT sqlserver.lock_redo_blocked;
GO

ALTER EVENT SESSION [AlwaysOn_health] ON SERVER 
ADD EVENT sqlserver.alwayson_ddl_executed(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_group_lease_expired(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_automatic_failover_validation(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_manager_state_change(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_state(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.availability_replica_state_change(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.error_reported(
    ACTION(sqlserver.server_instance_name)
    WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265) OR [error_number]=(41188) OR [error_number]=(41189))), 
ADD EVENT sqlserver.hadr_db_partner_set_sync_state(
    ACTION(sqlserver.server_instance_name)), 
ADD EVENT sqlserver.lock_redo_blocked(
    ACTION(sqlserver.server_instance_name))
GO

Once the event session is modified, the files can easily be merged together in the UI for all future troubleshooting and each event will have the originating server_instance_name attached to it by the action.

Tracking Compiles with Extended Events

This post comes from a question by Tara Kizer on DBA StackExchange related to Compilations/Sec in PerfMon and using Extended Events to try and track the compilations that were happening in SQL Server for one of their clients.  There are a number of events in Extended Events that relate to plans in SQL Server, and Tara was using the sqlserver.query_pre_execution_showplan event as a way of tracking compiles with Extended Events.  When I read the post on StackExchange this stood out to me as not being the correct event for tracking an actual compilation, and I would have selected the query_post_compilation_showplan event instead.  These two events have similar descriptions, but they differ by one specific word:

query_pre_execution_showplan

Occurs after a SQL statement is compiled. This event returns an XML representation of the estimated query plan that is generated when the query is optimized. Using this event can have a significant performance overhead so it should only be used when troubleshooting or monitoring specific problems for brief periods of time.

query_post_compilation_showplan

Occurs after a SQL statement is compiled. This event returns an XML representation of the estimated query plan that is generated when the query is compiled. Using this event can have a significant performance overhead so it should only be used when troubleshooting or monitoring specific problems for brief periods of time.

NOTE: Both of these event descriptions state “Using this event can have a significant performance overhead so it should only be used when troubleshooting or monitoring specific problems for brief periods of time.” There are only a handful of events in Extended Events that include this warning in their description and each can introduce significant performance impacts to production workloads, so using them in live workloads is not something I’d recommend. (query_post_execution_showplan impact post)

The events aren’t exactly the same in description and occur at different times using the repro example from the StackExchange post. Using a much larger event session definition that includes additional events, it is easy to see where compilations actually are happening.  The event session definition used for proofing this is:

CREATE EVENT SESSION [Compiles] ON SERVER
ADD EVENT sqlserver.query_post_compilation_showplan(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.query_pre_execution_showplan(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sp_cache_hit(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sp_cache_insert(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sp_cache_miss(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sp_cache_remove(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sp_statement_completed(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sp_statement_starting(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sql_batch_completed(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sql_batch_starting(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sql_statement_completed(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sql_statement_recompile(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.sql_statement_starting(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0))),
ADD EVENT sqlserver.uncached_sql_batch_statistics(
    ACTION(sqlserver.client_app_name,sqlserver.sql_text)
    WHERE ([sqlserver].[is_system]=(0)))
WITH (MAX_MEMORY=4096 KB,
      EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,
      MAX_DISPATCH_LATENCY=30 SECONDS,
      MAX_EVENT_SIZE=0 KB,
      MEMORY_PARTITION_MODE=NONE,
      TRACK_CAUSALITY=ON,
      STARTUP_STATE=OFF)
GO

Using part of the repro example from the StackExchange post with slight changes to include clearing the plan cache and also to break the steps down into individual batches and then individual statements we can show how the plans were being used from cache and where compiles actually are occurring.

CREATE TABLE t1 (transaction_id int, Column2 varchar(100));
CREATE TABLE t2 (Column1 varchar(max), Column2 varchar(100));
GO

CREATE TRIGGER t2_ins
ON t2
AFTER INSERT
AS

INSERT INTO t1
SELECT (SELECT TOP 1 transaction_id FROM sys.dm_tran_active_transactions), Column2
FROM inserted;
GO

DBCC FREEPROCCACHE
GO

--Both of these show compilation events
INSERT INTO t2 VALUES ('row1', 'value1');
INSERT INTO t2 VALUES ('row2', 'value2');
GO

--Both of these show compilation events
INSERT INTO t2 VALUES ('row1', 'value1');
INSERT INTO t2 VALUES ('row2', 'value2');
GO

--Both of these show compilation events
INSERT INTO t2 VALUES ('row1', 'value1');
GO

INSERT INTO t2 VALUES ('row2', 'value2');
GO

Here you can see the first compilation happening for the INSERT statements as prepared plans being auto-parameterized in the green box. The trigger is compiled in the red box and the plan is inserted into the cache as shown by the sp_cache_insert event. Then in the orange box the trigger execution gets a cache hit and reuses the trigger plan for the second INSERT statement in the batch, so it’s not compiling every execution of the INSERT command and the plan does get reused as you can see with the sp_cache_hit event for the trigger.  The uncached_sql_batch_statistics event also fires for the initial batch since it doesn’t have a plan in cache and has to compile.

If we run the two INSERT statements individually again after the first execution the trigger doesn’t compile again as shown in the events below:

Here the first statement encounters a cache hit for the prepared auto-parameterized version of the statement in cache but a miss for the adhoc batch that was submitted. The trigger gets a cache hit and doesn’t compile again as shown in the red block of events. The green block of events repeats this behavior for the second INSERT statement run as a separate batch. However, in every case you still see the query_pre_execution_showplan event firing which I can only attribute to the difference in being optimized vs compiled in the event description, but the trigger isn’t compiling for every execution as shown by these series of events.  The query_post_compilation_showplan event only fires when the plan is actually being compiled for the statements and/or trigger.

Tracking database usage with Extended Events returns invalid database_id?

I was recently emailed with a question about tracking database usage information using Extended Events and the person wanted to know why the event session was returning invalid database_id’s as a part of what it was collecting.  The funny thing is that I’ve seen this before during demo’s when I have a very specific concurrent workload running for what I am demonstrating and I used to make sure to do this exact thing so I could explain it during Immersion Events.  The TLDR answer to this is that Extended Events isn’t returning an invalid database_id at the time that the event occurs, the database_id just isn’t valid when the data is being reviewed or consumed.  How could that be you wonder?

DBCC CHECKDB or a manual user snapshot of a database; in either of these two scenarios, the snapshot used internally by DBCC CHECKDB (also CHECKTABLE, and the other CHECK commands) is assigned a database_id just like a regular user snapshot is, so when the event fires the database_id reported for the event is actually correct, but the snapshot is eliminated by the time the data is consumed, so using DB_NAME() on the database_id returns NULL making it appear like the event session is returning invalid data.