This post covers examples from Denali CTP3, version 11.0.1440.
In this post I’ll be discussing the SQL Server health related output from Denali’s sys.sp_server_diagnostics system stored procedure. This system stored procedure is used to check Denali SQL Server health status behind-the-scenes for both failover cluster instances and availability groups.
While this procedure is used in service of base functionality, you can still call sys.sp_server_diagnostics directly yourself and find some pretty useful information embedded in the [data] column. This is information that you could gather using other methods as well, but if you’re in a situation where you don’t have access to your script toolbox or if there are no existing data collection processes running on the SQL Server instance you have a quick way to snapshot the health of the SQL Server instance at a specific point in time.
While this procedure is documented in Books Online, the output it returns may change in the future and so you should avoid creating application dependencies on it. With that said, I think this is a great out-of-the-box tool and I can definitely see using it in the future to supplement other data sources or serve as a quick assessment in a pinch…
For this post I’ll capture a single execution of the stored procedure into a new table in tempdb:
CREATE TABLE dbo.tmp_sp_server_diagnostics
As an aside, I’m using the xml data type for the “data” column in the tmp_sp_server_diagnostics table – but that is not its actual data type according to Books Online (which states that it is varchar(max)). Aaron Bertrand (b | t) actually posted a Microsoft Connect item regarding this observation several months ago, discussing the merits of using the xml data type instead (one advantage being that the format is easier to display in SSMS). And if I designate the [data] column as xml for the destination output table, it implicitly converts upon INSERT just fine.
The following query returns the 5 rows (without doing anything fancy with the XML value):
SELECT create_time, component_name, state, state_desc, data
create_time component_name state state_desc
2011-10-04 07:57:07.503 system 1 clean
2011-10-04 07:57:07.503 resource 1 clean
2011-10-04 07:57:07.503 query_processing 1 clean
2011-10-04 07:57:07.503 io_subsystem 1 clean
2011-10-04 07:57:07.503 events 0 unknown
I intentionally left off the xml [data] column off the previous result set as this is what I’ll be walking through over this post and the next. Each row represents a different monitored “component” – covering system, resource, query processing, IO subsystem and events. I won’t recap BOL on what each of the rows reflects – but instead I’ll jump into the output and call out why this is useful.
Since the [data] column was captured in the xml data type, clicking on any value in SQL Server Management Studio should bring you to the output in a separate, formatted window. So on my system, the first row was for the “system” component, which had a value as follows:
<system sqlCpuUtilization="0" systemCpuUtilization="4" pageFaults="2654" nonYieldingTasksReported="0" intervalDumpRequests="0" totalDumpRequests="0" writeAccessViolationCount="0" isAccessViolationOccurred="0" latchWarnings="0" sickSpinlockTypeAfterAv="none" sickSpinlockType="none" spinlockBackoffs="0"/>
Based on the naming conventions we’re looking at – the following categories of information for the “system” component data get surfaced:
· CPU utilization (sqlCpuUtilization, systemCpuUtilization)
· Page Faults (pageFaults)
· Non Yielding tasks (nonYieldingTasksReported)
· Dumps (intervalDumpRequests, totalDumpRequests)
· Access Violations (writeAccessViolationCount, isAccessViolationOccurred )
· Latches (latchWarnings)
· Spinlock activity (spinlockBackoffs, sickSpinlockType, sickSpinlockTypeAfterAv )
So basically we’re looking at a bundling of occurrence counts and current values that would be quite helpful to a support engineer. I’ll be interested to find out what rules are used to assess “clean” state for the system component. I imagine this can be reversed engineered through various scenarios, but hopefully the logic will be revealed directly. The actual text of sp_server_diagnostics is “(server internal)” if you try looking at the definition.
Moving along, for the “resource” component data there are three areas of interest in the XML output. The first is around general resource monitoring:
<resource lastNotification="RESOURCE_MEMPHYSICAL_HIGH" outOfMemoryExceptions="0" isAnyPoolOutOfMemory="0" processOutOfMemoryPeriod="0">
So the lastNotification looks to be the memory manager’s latest report, with RESOURCE_MEMPHYSICAL_HIGH telling us that SQL Server can further consume more memory (in this example). The other attributes cover out-of-memory statistics.
The second section of information from the “resource” component data is a memory report around Process and System counters and counts:
<memoryReport name="Process/System Counts" unit="Value">
<entry description="Available Physical Memory" value="286638080" />
<entry description="Available Virtual Memory" value="8792557727744" />
<entry description="Available Paging File" value="1205690368" />
<entry description="Working Set" value="184197120" />
<entry description="Percent of Committed Memory in WS" value="100" />
<entry description="Page Faults" value="70284" />
<entry description="System physical memory high" value="1" />
<entry description="System physical memory low" value="0" />
<entry description="Process physical memory low" value="0" />
<entry description="Process virtual memory low" value="0" />
Again this is an example where we are pulling data (such as available physical and virtual memory) that could be pulled via other methods and sources. But in absence of other ready-to-use queries, using sp_server_diagnostics you have a nice sampling of the data as of a point in time.
The third memory report from the “resource” component data covers Memory Manager counter values:
<memoryReport name="Memory Manager" unit="KB">
<entry description="VM Reserved" value="3098256" />
<entry description="VM Committed" value="162896" />
<entry description="Locked Pages Allocated" value="0" />
<entry description="Large Pages Allocated" value="0" />
<entry description="Emergency Memory" value="1024" />
<entry description="Emergency Memory In Use" value="16" />
<entry description="Target Committed" value="389912" />
<entry description="Current Committed" value="162896" />
<entry description="Pages Allocated" value="109672" />
<entry description="Pages Reserved" value="0" />
<entry description="Pages Free" value="3464" />
<entry description="Pages In Use" value="122536" />
<entry description="Page Alloc Potential" value="873184" />
<entry description="NUMA Growth Phase" value="0" />
<entry description="Last OOM Factor" value="0" />
<entry description="Last OS Error" value="0" />
In this case, although this summarized data is good, I do think collecting DBCC MEMORYSTATUS would still be useful (more detailed) along with perfmon data so I can see memory trends over time. As with any new feature, the real test of utility is in using it during an actual issue and seeing if it helps.
In Part II of this post I’ll cover the “query_processing” (my favorite out of the bunch), “io_subsystem” and “events” component data. At a future point I’d also like to share queries for parsing out the data and also comparing samplings over time, particularly since sp_server_diagnostics has a built-in repeat interval capability, enabling it to sample data every X number of seconds.