Tracking Extended Events for a SQL Agent Job

This blog post is courtesy of a question I received from Chris Adkin (Twitter) by email.  Chris was trying to use Extended Events to track session level wait information, but he had a new twist that I’ve never been asked about before, he wanted to track all of the waits for a specific SQL Agent job anytime it executed.  I knew immediately when I read the email that this was definitely possible, but it required some additional information about SQL Agent. 

Initially it might seem like this would be a good scenario to filter on sqlserver.session_id, but that won’t work because the session_id is not guaranteed to be the same across executions.  I thought about using one job step to ALTER the event session, but that won’t be guaranteed to catch everything for the next step in the job.  Then I thought back to a few years ago and I wanted to track long running Agent jobs on my servers, and then I remembered that you can do this with the program_name or in Extended Events, the sqlserver.client_app_name global predicate field.  To demonstrate this, I created a job on my VM that calls a simple WAITFOR DELAY in the first job step and then I ran the job and queried sys.dm_exec_sessions to get the program name.

SQLAgent – TSQL JobStep (Job 0x6F4E4936767BEA4CBE088245356A5C54 : Step 1)

If we take this and compare it to the job_id in msdb.dbo.sysjobs for the job, we can see that the guid for the job_id has been transposed to create the program_name being used:

36494E6F7B764CEABE088245356A5C54

So with this information, we now define a predicate that will allow our Event Session to collect events specifically for our job execution as follows:

DECLARE @JobName NVARCHAR(128) = 'Test';

DECLARE @Predicate NVARCHAR(MAX)='';
DECLARE @sqlcmd NVARCHAR(MAX)='';

SET @sqlcmd ='IF EXISTS(SELECT * FROM sys.server_event_sessions WHERE name=''TrackResourceWaits'')
    DROP EVENT SESSION [TrackResourceWaits] ON SERVER;
CREATE EVENT SESSION [TrackResourceWaits] ON SERVER 
ADD EVENT  sqlos.wait_info
(   WHERE
        (opcode = 1 --End Events Only
            AND duration > 0 -- had to accumulate 100ms of time
            AND ({0})
        ))
ADD TARGET package0.asynchronous_bucketizer
(     SET filtering_event_name=''sqlos.wait_info'',
        source_type=0, -- specifies bucketing on column 
        source=''wait_type'' -- Bucket by wait_type
),
ADD TARGET package0.ring_buffer(SET max_memory=4096)';

SELECT @Predicate = @Predicate + 
    'sqlserver.client_app_name = ''' +
    'SQLAgent - TSQL JobStep (Job 0x'+
    SUBSTRING(job_id,7,2) +
    SUBSTRING(job_id,5,2) +
    SUBSTRING(job_id,3,2) +
    SUBSTRING(job_id,1,2) +
    SUBSTRING(job_id,12,2) +
    SUBSTRING(job_id,10,2) +
    SUBSTRING(job_id,17,2) +
    SUBSTRING(job_id,15,2) +
    SUBSTRING(job_id,20,4) +
    SUBSTRING(job_id,25,12) + 
    ' : Step ' + 
    CAST(step_id AS VARCHAR(3)) + 
    ')'' OR '
FROM( SELECT CAST(j.job_id AS VARCHAR(50)) AS job_id,
            js.step_id
        FROM msdb.dbo.sysjobs AS j
        INNER JOIN msdb.dbo.sysjobsteps AS js
            ON j.job_id = js.job_id
        WHERE j.name = @JobName) AS tab;

-- Append the predicate into the SQL command
SET @sqlcmd = REPLACE(@sqlcmd, '{0}', LEFT(@Predicate, LEN(@Predicate)-3));

-- Create the Event Session
EXEC(@sqlcmd);

Now as I pointed out in my original post about collecting session level wait stats, this doesn’t necessarily provide actionable data since it doesn’t tell us about the other activity on the server that may be causing specific waits to occur.  For example, if you run this and see that there are high I/O waits, but concurrently at the same time there is a DBCC CHECKDB running on the server that is saturating the I/O subsystem, all we know about for the data collected is that there were high I/O waits, it doesn’t tell us why they might exist.  For some wait types this might be very actionable, but I would use caution tracking session or task specific waits because the same waits will not occur on different hardware, or under different conditions on the same exact hardware. 

Identifying High Compile Time Statements from the Plan Cache

If you don’t already know it, I love query the plan cache in SQL Server to identify problems that might exist in a server.  I’ve blogged a number of ways to use plan cache data to identify and diagnose problems the last few years:

Finding what queries in the plan cache use a specific index
Tuning ‘cost threshold for parallelism’ from the Plan Cache
Finding Implicit Column Conversions in the Plan Cache
Digging into the SQL Plan Cache: Finding Missing Indexes

Last week on the forums a question was asked about how to use Extended Events to identify queries that took excessive amounts of time to compile.  What intrigued me was that the person asking the question claimed to have done this type of analysis with SQL Trace, so in 2012, that should mean that we can do it with Extended Events, right?  As Borat would say, not so much!

After seeing the events being used I took a look at the Event map, and while there is a loose mapping between the trace events that were being used to Extended Events, not all the same data is generated.  The primary event in trace that was being used for this is the Performance Statistics event, which is has a trace event id of 165.  If we lookup this event in the sys.trace_xe_event_map table in SQL Server 2012, we’ll find that it maps to three different events in Extended Events; query_cache_removal_statistics, query_pre_execution_showplan, and uncached_sql_batch_statistics. 

Immediately, I said to myself, “No, this isn’t possible in Extended Events.” simply because the overhead of turning on query_pre_execution_showplan, will kill performance of any production server (see http://connect.microsoft.com/SQLServer/feedback/details/732870/sqlserver-query-post-execution-showplan-performance-impact which is for the query_post_execution_showplan event, but all of the showplan events have the same impact).  Looking at the columns for the other two events, left me basically in the same mindset, you can’t do this with Extended Events.  The uncached_sql_batch_statistics event only captures the statement that wasn’t cached, and the query_cache_removal_statistics only gives you the execution statistics, not the parse and compile times.  Even if you were to turn on the query_pre_execution_showplan event, it doesn’t provide you with the CPU and duration columns for output, so it is not going to be useful here.

To be honest, I didn’t think that this was going to be possible, but then I thought about some of the fun we had in our Immersion Event week two in London looking at query compile issues, and I remembered that I wrote a parser between modules to pull compile time information from the showplans in the plan cache.  After searching through some scratch code files in my user profile I found a relevant one and started playing around a little with the XML parsing.  The result is the below query, which identifies the TOP 10 most expensive to compile statements in the cache currently, and includes the query stats for the statement, based on the query_hash, though it could be easily modified to use query_plan_hash to match on instead.

-- Find high compile resource plans in the plan cache
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED;
WITH XMLNAMESPACES 
(DEFAULT 'http://schemas.microsoft.com/sqlserver/2004/07/showplan')
SELECT TOP 10
CompileTime_ms,
CompileCPU_ms,
CompileMemory_KB,
qs.execution_count,
qs.total_elapsed_time/1000 AS duration_ms,
qs.total_worker_time/1000 as cputime_ms,
(qs.total_elapsed_time/qs.execution_count)/1000 AS avg_duration_ms,
(qs.total_worker_time/qs.execution_count)/1000 AS avg_cputime_ms,
qs.max_elapsed_time/1000 AS max_duration_ms,
qs.max_worker_time/1000 AS max_cputime_ms,
SUBSTRING(st.text, (qs.statement_start_offset / 2) + 1,
(CASE qs.statement_end_offset
WHEN -1 THEN DATALENGTH(st.text)
ELSE qs.statement_end_offset
END - qs.statement_start_offset) / 2 + 1) AS StmtText,
query_hash,
query_plan_hash
FROM
(
SELECT 
c.value('xs:hexBinary(substring((@QueryHash)[1],3))', 'varbinary(max)') AS QueryHash,
c.value('xs:hexBinary(substring((@QueryPlanHash)[1],3))', 'varbinary(max)') AS QueryPlanHash,
c.value('(QueryPlan/@CompileTime)[1]', 'int') AS CompileTime_ms,
c.value('(QueryPlan/@CompileCPU)[1]', 'int') AS CompileCPU_ms,
c.value('(QueryPlan/@CompileMemory)[1]', 'int') AS CompileMemory_KB,
qp.query_plan
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_query_plan(cp.plan_handle) AS qp
CROSS APPLY qp.query_plan.nodes('ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS n(c)
) AS tab
JOIN sys.dm_exec_query_stats AS qs
ON tab.QueryHash = qs.query_hash
CROSS APPLY sys.dm_exec_sql_text(qs.sql_handle) AS st
ORDER BY CompileTime_ms DESC
OPTION(RECOMPILE, MAXDOP 1);

Running this against a couple of test servers, I found a couple of queries that had 12 second compile times with 3 second execution durations.  Now the challenge becomes testing if the query can be simplified to reduce the time it takes the optimizer to compile it, without reducing execution performance.  The above code should work on any version of SQL Server from 2005 Service Pack 2 onwards, since the CompileTime, Compile CPU, and CompileMemory attributes were added in Service Pack 2.  I don’t have a SQL Server 2005 instance to test it on, but I did test this on SQL Server 2008 and 2012.

EDIT:

This only works on 2008 and higher because the query fingerprints in dm_exec_query_stats weren't in SQL Server 2005.  Using plan_handle will get you close, but that only tells you the plan had a high compile cost statement in it.  You will have to dig into the plan to see which specific statement, which I am trying to avoid with the code above.  I'll have to figure out another way to match these up based on the XML and blog the 2005 solution. 

New Article on Tracking Tempdb Contention in SQL Server 2012 with Extended Events

My latest article on Simple-Talk was published recently.  In this article I dig into how to track allocation bitmap contention in tempdb in SQL Server 2012 using Extended Events.  The method described in this article redefines how you troubleshoot this type of problem, leveraging the advanced diagnostic capabilities of Extended Events to track the problem.

http://www.simple-talk.com/sql/database-administration/optimizing-tempdb-configuration-with-sql-server-2012-extended-events/