While setting up my new Availability Group using SQL Server 2012 RC0 tonight, I noticed an interesting new addition to Extended Events associated with Availability Group configuration in the Release Candidate. When you setup an Availability Group in RC0, another default Event Session is created on the servers that participate in the Availability Group to provide monitoring of the health of the Availability Group overall. The definition of the monitoring session is as follows:
CREATE EVENT SESSION [AlwaysOn_health] ON SERVER
ADD EVENT sqlserver.alwayson_ddl_executed,
ADD EVENT sqlserver.availability_group_lease_expired,
ADD EVENT sqlserver.availability_replica_automatic_failover_validation,
ADD EVENT sqlserver.availability_replica_manager_state_change,
ADD EVENT sqlserver.availability_replica_state_change,
ADD EVENT sqlserver.error_reported(
WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265))),
ADD EVENT sqlserver.lock_redo_blocked
ADD TARGET package0.event_file(SET filename=N'AlwaysOn_health.xel',max_file_size=(5),max_rollover_files=(4))
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=ON)
GO
This Extended Event Session monitors a number of critical events in the system but one of the problems with figuring out what exactly this Event Session is monitoring is to figure out what all the predicate values on the sqlserver.error_reported event are actually firing on. To that aspect of things, we can do a quick reuse of the predicate on this event be doing a replace SSMS on the [error_number] value with a replace for an alias to a query to sys.messages, on the message_id column from the DMV as follows:
SELECT message_id, severity, is_event_logged, text
FROM sys.messages AS m
WHERE m.language_id = SERVERPROPERTY('LCID')
AND (m.message_id=(9691)
OR m.message_id=(35204)
OR m.message_id=(9693)
OR m.message_id=(26024)
OR m.message_id=(28047)
OR m.message_id=(26023)
OR m.message_id=(9692)
OR m.message_id=(28034)
OR m.message_id=(28036)
OR m.message_id=(28048)
OR m.message_id=(28080)
OR m.message_id=(28091)
OR m.message_id=(26022)
OR m.message_id=(9642)
OR m.message_id=(35201)
OR m.message_id=(35202)
OR m.message_id=(35206)
OR m.message_id=(35207)
OR m.message_id=(26069)
OR m.message_id=(26070)
OR m.message_id>(41047)
AND m.message_id<(41056)
OR m.message_id=(41142)
OR m.message_id=(41144)
OR m.message_id=(1480)
OR m.message_id=(823)
OR m.message_id=(824)
OR m.message_id=(829)
OR m.message_id=(35264)
OR m.message_id=(35265)
)
This will give us a list of the error messages that the Event Session will actually fire events for:
message_id |
severity | is_event_logged | text |
823 | 24 | 1 | The operating system returned error %ls to SQL Server during a %S_MSG at offset %#016I64x in file '%ls'. Additional messages in the SQL Server error log and system event log may provide more detail. This is a severe system-level error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. |
824 | 24 | 1 | SQL Server detected a logical consistency-based I/O error: %ls. It occurred during a %S_MSG of page %S_PGID in database ID %d at offset %#016I64x in file '%ls'. Additional messages in the SQL Server error log or system event log may provide more detail. This is a severe error condition that threatens database integrity and must be corrected immediately. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online. |
829 | 21 | 1 | Database ID %d, Page %S_PGID is marked RestorePending, which may indicate disk corruption. To recover from this state, perform a restore. |
1480 | 10 | 0 | The %S_MSG database "%.*ls" is changing roles from "%ls" to "%ls" because the mirroring session or availability group failed over due to %S_MSG. This is an informational message only. No user action is required. |
9642 | 16 | 0 | An error occurred in a Service Broker/Database Mirroring transport connection endpoint, Error: %i, State: %i. (Near endpoint role: %S_MSG, far endpoint address: '%.*hs') |
9691 | 10 | 0 | The %S_MSG endpoint has stopped listening for connections. |
9692 | 16 | 0 | The %S_MSG endpoint cannot listen on port %d because it is in use by another process. |
9693 | 16 | 0 | The %S_MSG endpoint cannot listen for connections due to the following error: '%.*ls'. |
26022 | 10 | 1 | Server is listening on [ %hs <%hs> %d]. |
26023 | 16 | 1 | Server TCP provider failed to listen on [ %hs <%hs> %d]. Tcp port is already in use. |
26024 | 16 | 1 | Server failed to listen on %hs <%hs> %d. Error: %#x. To proceed, notify your system administrator. |
26069 | 10 | 1 | Started listening on virtual network name '%ls'. No user action is required. |
26070 | 10 | 1 | Stopped listening on virtual network name '%ls'. No user action is required. |
28034 | 10 | 0 | Connection handshake failed. The login '%.*ls' does not have CONNECT permission on the endpoint. State %d. |
28036 | 10 | 0 | Connection handshake failed. The certificate used by this endpoint was not found: %S_MSG. Use DBCC CHECKDB in master database to verify the metadata integrity of the endpoints. State %d. |
28047 | 10 | 1 | %S_MSG login attempt failed with error: '%.*ls'. %.*ls |
28048 | 10 | 1 | %S_MSG login attempt by user '%.*ls' failed with error: '%.*ls'. %.*ls |
28080 | 10 | 0 | Connection handshake failed. The %S_MSG endpoint is not configured. State %d. |
28091 | 10 | 0 | Starting endpoint for %S_MSG with no authentication is not supported. |
35201 | 10 | 0 | A connection timeout has occurred while attempting to establish a connection to availability replica '%ls' with id [%ls]. Either a networking or firewall issue exists, or the endpoint address provided for the replica is not the database mirroring endpoint of the host server instance. |
35202 | 10 | 0 | A connection for availability group '%ls' from availability replica '%ls' with id [%ls] to '%ls' with id [%ls] has been successfully established. This is an informational message only. No user action is required. |
35204 | 10 | 0 | The connection between server instances '%ls' with id [%ls] and '%ls' with id [%ls] has been disabled because the database mirroring endpoint was either disabled or stopped. Restart the endpoint by using the ALTER ENDPOINT Transact-SQL statement with STATE = STARTED. |
35206 | 10 | 0 | A connection timeout has occurred on a previously established connection to availability replica '%ls' with id [%ls]. Either a networking or a firewall issue exists or the availability replica has transitioned to the resolving role. |
35207 | 16 | 0 | Connection attempt on availability group id '%ls' from replica id '%ls' to replica id '%ls' failed because of error %d, severity %d, state %d. |
35264 | 10 | 0 | AlwaysOn Availability Groups data movement for database '%.*ls' has been suspended for the following reason: "%S_MSG" (Source ID %d; Source string: '%.*ls'). To resume data movement on the database, you will need to resume the database manually. For information about how to resume an availability database, see SQL Server Books Online. |
35265 | 10 | 0 | AlwaysOn Availability Groups data movement for database '%.*ls' has been resumed. This is an informational message only. No user action is required. |
41048 | 10 | 1 | AlwaysOn Availability Groups: Local Windows Server Failover Clustering service has become unavailable. This is an informational message only. No user action is required. |
41049 | 10 | 1 | AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is no longer online. This is an informational message only. No user action is required. |
41050 | 10 | 1 | AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering service to start. This is an informational message only. No user action is required. |
41051 | 10 | 1 | AlwaysOn Availability Groups: Local Windows Server Failover Clustering service started. This is an informational message only. No user action is required. |
41052 | 10 | 1 | AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering node to start. This is an informational message only. No user action is required. |
41053 | 10 | 1 | AlwaysOn Availability Groups: Local Windows Server Failover Clustering node started. This is an informational message only. No user action is required. |
41054 | 10 | 1 | AlwaysOn Availability Groups: Waiting for local Windows Server Failover Clustering node to come online. This is an informational message only. No user action is required. |
41055 | 10 | 1 | AlwaysOn Availability Groups: Local Windows Server Failover Clustering node is online. This is an informational message only. No user action is required. |
41142 | 16 | 0 | The availability replica for availability group '%.*ls' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group, or the WSFC cluster was started in Force Quorum mode. If the cluster was started in Force Quorum mode or the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss). For more information, see SQL Server Books Online. |
41144 | 16 | 0 | The local availability replica of availability group '%.*ls' is in a failed state. The replica failed to read or update the persisted configuration data (SQL Server error: %d). To recover from this failure, either restart the local Windows Server Failover Clustering (WSFC) service or restart the local instance of SQL Server. |
Based on this output, and the output of the following query:
SELECT name, description
FROM sys.dm_xe_objects
WHERE NAME IN (
'alwayson_ddl_executed',
'availability_group_lease_expired',
'availability_replica_automatic_failover_validation',
'availability_replica_manager_state_change',
'availability_replica_state_change',
'error_reported',
'lock_redo_blocked')
We can deduce the following about the Event Session:
CREATE EVENT SESSION [AlwaysOn_health] ON SERVER
–Occurs when AlwaysOn DDL is executed including CREATE, ALTER or DROP
ADD EVENT sqlserver.alwayson_ddl_executed,
–Occurs when there is a connectivity issue between the cluster and the Availability Group resulting
–in a failure to renew the lease
ADD EVENT sqlserver.availability_group_lease_expired,
–Occurs when the failover validates the readiness of replica as a primary. For instance, the failover
–validation will return false when not all databases are synchronized or not joined
ADD EVENT sqlserver.availability_replica_automatic_failover_validation,
–Occurs when the state of the Availability Replica Manager has changed.
ADD EVENT sqlserver.availability_replica_manager_state_change,
–Occurs when the state of the Availability Replica has changed.
ADD EVENT sqlserver.availability_replica_state_change,
–Occurs when an error is reported based on the previously listed table
ADD EVENT sqlserver.error_reported(
WHERE ([error_number]=(9691) OR [error_number]=(35204) OR [error_number]=(9693) OR [error_number]=(26024) OR [error_number]=(28047) OR [error_number]=(26023) OR [error_number]=(9692) OR [error_number]=(28034) OR [error_number]=(28036) OR [error_number]=(28048) OR [error_number]=(28080) OR [error_number]=(28091) OR [error_number]=(26022) OR [error_number]=(9642) OR [error_number]=(35201) OR [error_number]=(35202) OR [error_number]=(35206) OR [error_number]=(35207) OR [error_number]=(26069) OR [error_number]=(26070) OR [error_number]>(41047) AND [error_number]<(41056) OR [error_number]=(41142) OR [error_number]=(41144) OR [error_number]=(1480) OR [error_number]=(823) OR [error_number]=(824) OR [error_number]=(829) OR [error_number]=(35264) OR [error_number]=(35265))),
–Occurs when the redo thread blocks when trying to acquire a lock.
ADD EVENT sqlserver.lock_redo_blocked
–Writes to the file target for persistence in the system beyond failovers and service restarts
ADD TARGET package0.event_file(SET filename=N'AlwaysOn_health.xel',max_file_size=(5),max_rollover_files=(4))
WITH (MAX_MEMORY=4096 KB, EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS, MAX_DISPATCH_LATENCY=30 SECONDS, MAX_EVENT_SIZE=0 KB, MEMORY_PARTITION_MODE=NONE, TRACK_CAUSALITY=OFF, STARTUP_STATE=ON)
GO
What is really cool is that this Event Session is used by the Availability Groups Dashboard to provide an overall status of the health of the Availability Group in Management Studio.