New AlwaysOn_health Extended Events Session in SQL Server 2012 RC0

While setting up my new Availability Group using SQL Server 2012 RC0 tonight, I noticed an interesting new addition to Extended Events associated with Availability Group configuration in the Release Candidate.  When you setup an Availability Group in RC0, another default Event Session is created on the servers that participate in the Availability Group to provide monitoring of the health of the Availability Group overall.  The definition of the monitoring session is as follows:

CREATE EVENT SESSION [AlwaysOn_health] ON SERVER
ADD EVENT sqlserver.alwayson_ddl_executed,
ADD EVENT sqlserver.availability_group_lease_expired,
ADD EVENT sqlserver.availability_replica_automatic_failover_validation,
ADD EVENT sqlserver.availability_replica_manager_state_change,
ADD EVENT sqlserver.availability_replica_state_change,
ADD EVENT sqlserver.error_reported(
    WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265))),
ADD EVENT sqlserver.lock_redo_blocked
ADD TARGET package0.event_file(SET filename=N'AlwaysOn_health.xel',max_file_size=(5),max_rollover_files=(4))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=ON)
GO

This Extended Event Session monitors a number of critical events in the system but one of the problems with figuring out what exactly this Event Session is monitoring is to figure out what all the predicate values on the sqlserver.error_reported event are actually firing on.  To that aspect of things, we can do a quick reuse of the predicate on this event be doing a replace SSMS on the [error_number] value with a replace for an alias to a query to sys.messages, on the message_id column from the DMV as follows:

SELECT message_id, severity, is_event_logged, text
FROM sys.messages AS m
WHERE m.language_id = SERVERPROPERTY('LCID')
  AND  (m.message_id=(9691)
        OR m.message_id=(35204)
        OR m.message_id=(9693)
        OR m.message_id=(26024)
        OR m.message_id=(28047)
        OR m.message_id=(26023)
        OR m.message_id=(9692)
        OR m.message_id=(28034)
        OR m.message_id=(28036)
        OR m.message_id=(28048)
        OR m.message_id=(28080)
        OR m.message_id=(28091)
        OR m.message_id=(26022)
        OR m.message_id=(9642)
        OR m.message_id=(35201)
        OR m.message_id=(35202)
        OR m.message_id=(35206)
        OR m.message_id=(35207)
        OR m.message_id=(26069)
        OR m.message_id=(26070)
        OR m.message_id>(41047)
        AND m.message_id<(41056)
        OR m.message_id=(41142)
        OR m.message_id=(41144)
        OR m.message_id=(1480)
        OR m.message_id=(823)
        OR m.message_id=(824)
        OR m.message_id=(829)
        OR m.message_id=(35264)
        OR m.message_id=(35265)
)

This will give us a list of the error messages that the Event Session will actually fire events for:

message_id

severity is_event_logged text
823 24 1 The operating system returned error %ls to SQL Server during a %S_MSG at offset %#016I64x in file '%ls'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
824 24 1 SQL Server detected a logical consistency-based I/O error: %ls. It occurred during a %S_MSG of page %S_PGID in database ID %d at offset %#016I64x in file '%ls'.  Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
829 21 1 Database ID %d, Page %S_PGID is marked RestorePending, which may indicate disk corruption. To recover from this state, perform a restore.
1480 10 0 The %S_MSG database "%.*ls" is changing roles from "%ls" to "%ls" because the mirroring session or availability group failed over due to %S_MSG. This is an informational message only. No user action is required.
9642 16 0 An error occurred in a Service Broker/Database Mirroring transport connection endpoint, Error: %i, State: %i. (Near endpoint role: %S_MSG, far endpoint address: '%.*hs')
9691 10 0 The %S_MSG endpoint has stopped listening for connections.
9692 16 0 The %S_MSG endpoint cannot listen on port %d because it is in use by another process.
9693 16 0 The %S_MSG endpoint cannot listen for connections due to the following error: '%.*ls'.
26022 10 1 Server is listening on [ %hs <%hs> %d].
26023 16 1 Server TCP provider failed to listen on [ %hs <%hs> %d]. Tcp port is already in use.
26024 16 1 Server failed to listen on %hs <%hs> %d. Error: %#x. To proceed, notify your system administrator.
26069 10 1 Started listening on virtual network name '%ls'. No user action is required.
26070 10 1 Stopped listening on virtual network name '%ls'. No user action is required.
28034 10 0 Connection handshake failed. The login '%.*ls' does not have CONNECT permission on the endpoint. State %d.
28036 10 0 Connection handshake failed. The certificate used by this endpoint was not found: %S_MSG. Use DBCC CHECKDB in master database to verify the metadata integrity of the endpoints. State %d.
28047 10 1 %S_MSG login attempt failed with error: '%.*ls'. %.*ls
28048 10 1 %S_MSG login attempt by user '%.*ls' failed with error: '%.*ls'. %.*ls
28080 10 0 Connection handshake failed. The %S_MSG endpoint is not configured. State %d.
28091 10 0  Starting endpoint for %S_MSG with no authentication is not supported.
35201 10 0 A connection timeout has occurred while attempting to establish a connection to availability replica '%ls' with id [%ls]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance.
35202 10 0 A connection for availability group '%ls' from availability replica '%ls' with id  [%ls] to '%ls' with id [%ls] has been successfully established.  This is an informational message only. No user action is required.
35204 10 0 The connection between server instances '%ls' with id [%ls] and '%ls' with id [%ls] has been disabled because the database mirroring endpoint was either disabled or stopped. Restart the endpoint by using the ALTER ENDPOINT Transact-SQL statement with STATE = STARTED.
35206 10 0 A connection timeout has occurred on a previously established connection to availability replica '%ls' with id [%ls].  Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role.
35207 16 0 Connection attempt on availability group id '%ls' from replica id '%ls' to replica id '%ls' failed because of error %d, severity %d, state %d.
35264 10 0 AlwaysOn Availability Groups data movement for database '%.*ls' has been suspended for the following reason: "%S_MSG" (Source ID %d; Source string: '%.*ls'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online.
35265 10 0 AlwaysOn Availability Groups data movement for database '%.*ls' has been resumed. This is an informational message only. No user action is required.
41048 10 1 AlwaysOn Availability Groups: Local Windows Server Failover Clustering service has become unavailable. This is an informational message only. No user action is required.
41049 10 1 AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is no longer online. This is an informational message only. No user action is required.
41050 10 1 AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required.
41051 10 1 AlwaysOn Availability Groups: Local Windows Server Failover Clustering service started. This is an informational message only. No user action is required.
41052 10 1 AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering node to start. This is an informational message only. No user action is required.
41053 10 1 AlwaysOn Availability Groups: Local Windows Server Failover Clustering node started. This is an informational message only. No user action is required.
41054 10 1 AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering node to come online. This is an informational message only. No user action is required.
41055 10 1 AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is online. This is an informational message only. No user action is required.
41142 16 0 The availability replica for availability group '%.*ls' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group, or the WSFC cluster was started in Force Quorum mode. If the cluster was started in Force Quorum mode or the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online.
41144 16 0 The local availability replica of availability group '%.*ls' is in a failed state.  The replica failed to read or update the persisted configuration data (SQL Server error: %d).  To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server.

Based on this output, and the output of the following query:

SELECT name, description
FROM sys.dm_xe_objects
WHERE NAME IN (
'alwayson_ddl_executed',
'availability_group_lease_expired',
'availability_replica_automatic_failover_validation',
'availability_replica_manager_state_change',
'availability_replica_state_change',
'error_reported',
'lock_redo_blocked')

We can deduce the following about the Event Session:

CREATE EVENT SESSION [AlwaysOn_health] ON SERVER
–Occurs when AlwaysOn DDL is executed including CREATE, ALTER or DROP
ADD EVENT sqlserver.alwayson_ddl_executed,
–Occurs when there is a connectivity issue between the cluster and the Availability Group resulting
–in a failure to renew the lease
ADD EVENT sqlserver.availability_group_lease_expired,
–Occurs when the failover validates the readiness of replica as a primary. For instance, the failover
–validation will return false when not all databases are synchronized or not joined
ADD EVENT sqlserver.availability_replica_automatic_failover_validation,
–Occurs when the state of the Availability Replica Manager has changed.
ADD EVENT sqlserver.availability_replica_manager_state_change,
–Occurs when the state of the Availability Replica has changed.
ADD EVENT sqlserver.availability_replica_state_change,
–Occurs when an error is reported based on the previously listed table
ADD EVENT sqlserver.error_reported(
    WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265))),
–Occurs when the redo thread blocks when trying to acquire a lock.
ADD EVENT sqlserver.lock_redo_blocked
–Writes to the file target for persistence in the system beyond failovers and service restarts
ADD TARGET package0.event_file(SET filename=N'AlwaysOn_health.xel',max_file_size=(5),max_rollover_files=(4))
WITH (MAX_MEMORY=4096 KB, EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS, MAX_DISPATCH_LATENCY=30 SECONDS, MAX_EVENT_SIZE=0 KB, MEMORY_PARTITION_MODE=NONE, TRACK_CAUSALITY=OFF, STARTUP_STATE=ON)
GO

What is really cool is that this Event Session is used by the Availability Groups Dashboard to provide an overall status of the health of the Availability Group in Management Studio.

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.