sqlskills-logo-2015-white.png

Under the covers with sp_server_diagnostics data (Part II)

This post covers examples from Denali CTP3, version 11.0.1440.

In my last post, “Under The Covers With Sp_server_diagnostics Data (Part I)” I started an exploration of the output surfaced from the sp_server_diagnostics stored procedure.  In this post, I’d like to continue this exploration, having a look specifically at the “query_processing”, “io_subsystem” and “events” component data.

Just a reminder of why we would be looking at this in the first place?  Quite simply, I see this output as a means of gathering out-of-the-box SQL Server health information in absence of (or in compliment of) existing scripts or processes running on the SQL Server instance.

So continuing on the exploration, the query_processing component data can actually be broken down into five useful groupings of statistics:

·        Query processing statistics

·        Top non-preemptive wait statistics by count

·        Top non-preemptive wait statistics by duration

·        Top preemptive waits by count

·        Top preemptive waits by duration

The following is an example of the query processing statistics data:

<queryProcessing maxWorkers="512" workersCreated="46" workersIdle="17" tasksCompletedWithinInterval="12" pendingTasks="0" oldestPendingTaskWaitingTime="0" hasUnresolvableDeadlockOccurred="0" hasDeadlockedSchedulersOccurred="0" trackingNonYieldingScheduler="0x0">

 

The naming conventions are intuitive – returning a point of time measurement of max workers, actual workers created, idle workers, pending tasks, oldest pending task waiting time (I’ll definitely be paying attention to this one) and scheduler deadlock and non-yielding counts.

 

For fans of evaluating wait statistics (and hopefully that’s most SQL Server folks these days) – we have a four sets of Top 10 wait stats results.  The following is the output of the top non-preemptive waits by count (and specifically – by the “waits” attribute):

 

<byCount>

        <wait waitType="HADR_FILESTREAM_IOMGR_IOCOMPLETION" waits="1884" averageWaitTime="967892" maxWaitTime="800" />

        <wait waitType="PAGEIOLATCH_SH" waits="534" averageWaitTime="1329" maxWaitTime="37" />

        <wait waitType="IO_COMPLETION" waits="71" averageWaitTime="128" maxWaitTime="4" />

        <wait waitType="THREADPOOL" waits="53" averageWaitTime="209" maxWaitTime="25" />

        <wait waitType="WRITELOG" waits="35" averageWaitTime="58" maxWaitTime="17" />

        <wait waitType="PERFORMANCE_COUNTERS_RWLOCK" waits="18" averageWaitTime="3" maxWaitTime="1" />

        <wait waitType="ASYNC_NETWORK_IO" waits="10" averageWaitTime="125" maxWaitTime="118" />

        <wait waitType="PAGEIOLATCH_EX" waits="8" averageWaitTime="14" maxWaitTime="6" />

        <wait waitType="SLEEP_DBSTARTUP" waits="8" averageWaitTime="779" maxWaitTime="175" />

        <wait waitType="CLR_AUTO_EVENT" waits="8" averageWaitTime="8527" maxWaitTime="4085" />

      </byCount>

 

We also have the non-preemptive waits by duration (and specifically – by the average wait time):

 

    <byDuration>

        <wait waitType="HADR_FILESTREAM_IOMGR_IOCOMPLETION" waits="1884" averageWaitTime="967892" maxWaitTime="800" />

        <wait waitType="CLR_AUTO_EVENT" waits="8" averageWaitTime="8527" maxWaitTime="4085" />

        <wait waitType="FT_IFTSHC_MUTEX" waits="6" averageWaitTime="4381" maxWaitTime="2977" />

        <wait waitType="SLEEP_MASTERDBREADY" waits="1" averageWaitTime="1836" maxWaitTime="1836" />

        <wait waitType="PWAIT_ALL_COMPONENTS_INITIALIZED" waits="3" averageWaitTime="1558" maxWaitTime="541" />

        <wait waitType="PAGEIOLATCH_SH" waits="534" averageWaitTime="1329" maxWaitTime="37" />

        <wait waitType="LCK_M_S" waits="5" averageWaitTime="1146" maxWaitTime="336" />

        <wait waitType="SLEEP_DBSTARTUP" waits="8" averageWaitTime="779" maxWaitTime="175" />

        <wait waitType="SLEEP_DCOMSTARTUP" waits="1" averageWaitTime="377" maxWaitTime="377" />

        <wait waitType="FFT_RECOVERY" waits="5" averageWaitTime="241" maxWaitTime="118" />

      </byDuration>

 

We have two more result sets for isolating out preemptive waits that are identical in format to the non-preemptive results, so I won’t repeat them here.  And while we can pull aggregated wait statistics from other sources – it is convenient to have this data at-the-ready.

 

Next we have the I/O component data which is far less verbose in my sample output:

 

<ioSubsystem ioLatchTimeouts="0" intervalLongIos="0" totalLongIos="0">

  <longestPendingRequests />

</ioSubsystem>

 

There we can see information on latch timeouts and long I/Os.  My test system didn’t have a value for the longest pending requests – and that is something I’ll look for later – but if you have a sample on your own Denali system, please post to the comments.   I’d also be curious to compare the value of this data against what gets exposed in sys.dm_io_pending_io_requests  and see which has more value from a troubleshooting perspective.

 

The last component to cover from sp_server_diagnostics data is the “events” output (here is the vanilla output without events for starters):

 

<events>

  <session startTime="2011-10-03T14:11:52.380" droppedEvents="0" largestDroppedEvent="0">

    <RingBufferTarget truncated="0" processingTime="0" totalEventsProcessed="0" eventCount="0" droppedCount="0" memoryUsed="0" />

  </session>

</events>

 

BOL states that this will expose events “of interest” recorded on the SQL Server instance – similar to what you could pull from the ring buffer.  So when testing a login failure by putting in a bad login identifier, I saw three events raised – one being “security_error_ring_buffer_recorded” and then the others being two “connectivity_ring_buffer_recorded” events.  The first connectivity event was for an event type of “Error” and another of “ConnClose”. 

The data exposed is pretty chatty – so you might wonder (as compared to the wait statistics) – whether it is worth sifting through.  One example is for security error ring buffer data – which in my testing did show a few bread crumbs of interesting troubleshooting data.  For example, useful data on the error number, state and remote host:

 

<data name="sni_consumer_error">

          <type name="uint32" package="package0" />

          <value>18456</value>

        </data>

<data name="state">

          <type name="uint32" package="package0" />

          <value>5</value>

        </data>

<data name="remote_host">

          <type name="ansi_string" package="package0" />

          <value>&lt;local machine&gt;</value>

        </data>

 

So from this data we know that an error 18456 and state 5 occurred.  We also know that state 5 represents an invalid user id.  Testing a bad password, I saw a state of 8, which does indeed represent a password mismatch (see the post Understanding "login failed" (Error 18456) error messages in SQL Server 2005 for more on this tangent).

So that’s an initial tour of the data exposed by sp_server_diagnostics.  As Denali starts rolling out more broadly, I anticipate several parsing scripts being generated by the SQL community.  There are several use-cases that I can imagine from this data, so I’ll keep watching this space and will report back the more valuable uses as I find them.

 

** Update – Aaron Bertrand wrote a very comprehensive 18456 reference in the following post, "Troubleshooting Error 18456".

3 thoughts on “Under the covers with sp_server_diagnostics data (Part II)

Comments are closed.

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.