New AlwaysOn_health Extended Events Session in SQL Server 2012 RC0

While setting up my new Availability Group using SQL Server 2012 RC0 tonight, I noticed an interesting new addition to Extended Events associated with Availability Group configuration in the Release Candidate. When you setup an Availability Group in RC0, another default Event Session is created on the servers that participate in the Availability Group to provide monitoring of the health of the Availability Group overall. The definition of the monitoring session is as follows:

CREATE EVENT SESSION [AlwaysOn_health] ON SERVER
ADD EVENT sqlserver.alwayson_ddl_executed,
ADD EVENT sqlserver.availability_group_lease_expired,
ADD EVENT sqlserver.availability_replica_automatic_failover_validation,
ADD EVENT sqlserver.availability_replica_manager_state_change,
ADD EVENT sqlserver.availability_replica_state_change,
ADD EVENT sqlserver.error_reported(
WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265))),
ADD EVENT sqlserver.lock_redo_blocked
ADD TARGET package0.event_file(SET filename=N'AlwaysOn_health.xel',max_file_size=(5),max_rollover_files=(4))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=ON)
GO

This Extended Event Session monitors a number of critical events in the system but one of the problems with figuring out what exactly this Event Session is monitoring is to figure out what all the predicate values on the sqlserver.error_reported event are actually firing on. To that aspect of things, we can do a quick reuse of the predicate on this event be doing a replace SSMS on the [error_number] value with a replace for an alias to a query to sys.messages, on the message_id column from the DMV as follows:

SELECT message_id, severity, is_event_logged, text
FROM sys.messages AS m
WHERE m.language_id = SERVERPROPERTY('LCID')
AND (m.message_id=(9691)
OR m.message_id=(35204) OR m.message_id=(9693) OR m.message_id=(26024) OR m.message_id=(28047) OR m.message_id=(26023) OR m.message_id=(9692) OR m.message_id=(28034) OR m.message_id=(28036) OR m.message_id=(28048) OR m.message_id=(28080) OR m.message_id=(28091) OR m.message_id=(26022) OR m.message_id=(9642) OR m.message_id=(35201) OR m.message_id=(35202) OR m.message_id=(35206) OR m.message_id=(35207) OR m.message_id=(26069) OR m.message_id=(26070) OR m.message_id>(41047) AND m.message_id<(41056) OR m.message_id=(41142) OR m.message_id=(41144) OR m.message_id=(1480) OR m.message_id=(823) OR m.message_id=(824) OR m.message_id=(829) OR m.message_id=(35264) OR m.message_id=(35265) )

This will give us a list of the error messages that the Event Session will actually fire events for:

message_id	severity	is_event_logged	text
823	24	1	The operating system returned error %ls to SQL Server during a %S_MSG at offset %#016I64x in file '%ls'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
824	24	1	SQL Server detected a logical consistency-based I/O error: %ls. It occurred during a %S_MSG of page %S_PGID in database ID %d at offset %#016I64x in file '%ls'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
829	21	1	Database ID %d, Page %S_PGID is marked RestorePending, which may indicate disk corruption. To recover from this state, perform a restore.
1480	10	0	The %S_MSG database "%.*ls" is changing roles from "%ls" to "%ls" because the mirroring session or availability group failed over due to %S_MSG. This is an informational message only. No user action is required.
9642	16	0	An error occurred in a Service Broker/Database Mirroring transport connection endpoint, Error: %i, State: %i. (Near endpoint role: %S_MSG, far endpoint address: '%.*hs')
9691	10	0	The %S_MSG endpoint has stopped listening for connections.
9692	16	0	The %S_MSG endpoint cannot listen on port %d because it is in use by another process.
9693	16	0	The %S_MSG endpoint cannot listen for connections due to the following error: '%.*ls'.
26022	10	1	Server is listening on [ %hs <%hs> %d].
26023	16	1	Server TCP provider failed to listen on [ %hs <%hs> %d]. Tcp port is already in use.
26024	16	1	Server failed to listen on %hs <%hs> %d. Error: %#x. To proceed, notify your system administrator.
26069	10	1	Started listening on virtual network name '%ls'. No user action is required.
26070	10	1	Stopped listening on virtual network name '%ls'. No user action is required.
28034	10	0	Connection handshake failed. The login '%.*ls' does not have CONNECT permission on the endpoint. State %d.
28036	10	0	Connection handshake failed. The certificate used by this endpoint was not found: %S_MSG. Use DBCC CHECKDB in master database to verify the metadata integrity of the endpoints. State %d.
28047	10	1	%S_MSG login attempt failed with error: '%.ls'. %.ls
28048	10	1	%S_MSG login attempt by user '%.ls' failed with error: '%.ls'. %.*ls
28080	10	0	Connection handshake failed. The %S_MSG endpoint is not configured. State %d.
28091	10	0	Starting endpoint for %S_MSG with no authentication is not supported.
35201	10	0	A connection timeout has occurred while attempting to establish a connection to availability replica '%ls' with id [%ls]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.
35202	10	0	A connection for availability group '%ls' from availability replica '%ls' with id [%ls] to '%ls' with id [%ls] has been successfully established. This is an informational message only. No user action is required.
35204	10	0	The connection between server instances '%ls' with id [%ls] and '%ls' with id [%ls] has been disabled because the database mirroring endpoint was either disabled or stopped. Restart the endpoint by using the ALTER ENDPOINT Transact-SQL statement with STATE = STARTED.
35206	10	0	A connection timeout has occurred on a previously established connection to availability replica '%ls' with id [%ls]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
35207	16	0	Connection attempt on availability group id '%ls' from replica id '%ls' to replica id '%ls' failed because of error %d, severity %d, state %d.
35264	10	0	AlwaysOn Availability Groups data movement for database '%.ls' has been suspended for the following reason: "%S_MSG" (Source ID %d; Source string: '%.ls'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
35265	10	0	AlwaysOn Availability Groups data movement for database '%.*ls' has been resumed. This is an informational message only. No user action is required.
41048	10	1	AlwaysOn Availability Groups: Local Windows Server Failover Clustering service has become unavailable. This is an informational message only. No user action is required.
41049	10	1	AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is no longer online. This is an informational message only. No user action is required.
41050	10	1	AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required.
41051	10	1	AlwaysOn Availability Groups: Local Windows Server Failover Clustering service started. This is an informational message only. No user action is required.
41052	10	1	AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering node to start. This is an informational message only. No user action is required.
41053	10	1	AlwaysOn Availability Groups: Local Windows Server Failover Clustering node started. This is an informational message only. No user action is required.
41054	10	1	AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering node to come online. This is an informational message only. No user action is required.
41055	10	1	AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is online. This is an informational message only. No user action is required.
41142	16	0	The availability replica for availability group '%.*ls' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group, or the WSFC cluster was started in Force Quorum mode. If the cluster was started in Force Quorum mode or the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.
41144	16	0	The local availability replica of availability group '%.*ls' is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: %d). To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.

Based on this output, and the output of the following query:

SELECT name, description
FROM sys.dm_xe_objects
WHERE NAME IN (
'alwayson_ddl_executed',
'availability_group_lease_expired',
'availability_replica_automatic_failover_validation',
'availability_replica_manager_state_change',
'availability_replica_state_change',
'error_reported',
'lock_redo_blocked')

We can deduce the following about the Event Session:

CREATE EVENT SESSION [AlwaysOn_health] ON SERVER
–Occurs when AlwaysOn DDL is executed including CREATE, ALTER or DROP
ADD EVENT sqlserver.alwayson_ddl_executed,
–Occurs when there is a connectivity issue between the cluster and the Availability Group resulting
–in a failure to renew the lease
ADD EVENT sqlserver.availability_group_lease_expired,
–Occurs when the failover validates the readiness of replica as a primary. For instance, the failover
–validation will return false when not all databases are synchronized or not joined
ADD EVENT sqlserver.availability_replica_automatic_failover_validation,
–Occurs when the state of the Availability Replica Manager has changed.
ADD EVENT sqlserver.availability_replica_manager_state_change,
–Occurs when the state of the Availability Replica has changed.
ADD EVENT sqlserver.availability_replica_state_change,
–Occurs when an error is reported based on the previously listed table
ADD EVENT sqlserver.error_reported(
WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265))),
–Occurs when the redo thread blocks when trying to acquire a lock.
ADD EVENT sqlserver.lock_redo_blocked
–Writes to the file target for persistence in the system beyond failovers and service restarts
ADD TARGET package0.event_file(SET filename=N'AlwaysOn_health.xel',max_file_size=(5),max_rollover_files=(4))
WITH (MAX_MEMORY=4096 KB, EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS, MAX_DISPATCH_LATENCY=30 SECONDS, MAX_EVENT_SIZE=0 KB, MEMORY_PARTITION_MODE=NONE, TRACK_CAUSALITY=OFF, STARTUP_STATE=ON)
GO

What is really cool is that this Event Session is used by the Availability Groups Dashboard to provide an overall status of the health of the Availability Group in Management Studio.

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.

New AlwaysOn_health Extended Events Session in SQL Server 2012 RC0

Leave a Reply Cancel reply

Other articles

Do you need Trace Flag 460?

Event exceeds the number of allowable bound actions

SQL Server 2022: Measuring Extended Events Performance Impacts

What’s New in SQL Server 2022: GENERATE_SERIES()

Logging Database Scoped Configuration Changes

SqlClient Configurable Retry Logic – ICYMI

Imagine feeling confident enough to handle whatever your database throws at you.