Tracking expensive queries with extended events in SQL 2008

As part of the 2008 DBA class we’re teaching down here in Melbourne, I did a demo of using predicates and file targets with extended events, so I want to blog the script for people to play with.

For background info on extended events see:

In this scenario, I’d like to track queries that make heavy usage of the CPU on my system.

First off I’ll create a test database to play with.

USE MASTER;
GO
IF DATABASEPROPERTYEX (‘production’, ‘Version’) > 0 DROP DATABASE production;
GO

CREATE DATABASE production;
GO
USE production;
GO

CREATE TABLE t1 (c1 INT IDENTITY, c2 UNIQUEIDENTIFIER ROWGUIDCOL DEFAULT NEWID(), c3 CHAR (5000) DEFAULT ‘a’);
CREATE CLUSTERED INDEX t1_CL ON t1 (c1);
CREATE NONCLUSTERED INDEX t1_NCL ON t1 (c2);
GO

SET NOCOUNT ON;
INSERT INTO t1 DEFAULT VALUES;
GO 1000

— Get the database ID to plug into the event session
SELECT DB_ID (‘production’);
GO

Notice that I grabbed the database ID of the new database I created, as I’ll need that when defining the extended event predicate.

IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE name = ‘EE_ExpensiveQueries’)
DROP EVENT SESSION EE_ExpensiveQueries ON SERVER;
GO

CREATE EVENT SESSION EE_ExpensiveQueries ON SERVER
ADD EVENT sqlserver.sql_statement_completed
(ACTION (sqlserver.sql_text, sqlserver.plan_handle)
WHERE sqlserver.database_id = 18 /*DBID*/  AND cpu_time > 10000 /*total ms of CPU time*/)
ADD TARGET package0.asynchronous_file_target
(SET FILENAME = N’C:\SQLskills\EE_ExpensiveQueries.xel’, METADATAFILE = N’C:\SQLskills\EE_ExpensiveQueries.xem’)
WITH (max_dispatch_latency = 1 seconds);
GO

I’m monitoring the event that fires whenever a statement completes, and I’m filtering by database ID (make sure you plug in the correct database ID when you try this yourself) and by the number of milliseconds of CPU time the statement used. Note: when using predicates you should always make sure that you order the predicate tests such that the most restrictive predicates (those most likely to evaluate to false) are first in the list, so the predicate evaluation ‘short-circuits’ as quickly as possible (a standard programming practice with boolean logic). I’m also using a file target, just to show how it can be done, instead of the ring buffer which is commonly used for extended events demos.

Now I’m going to turn on the event session and do some queries in the production database.

ALTER EVENT SESSION EE_ExpensiveQueries ON SERVER STATE = START;
GO

USE production;
GO

SELECT COUNT (*) FROM t1 WHERE c1 > 500;
GO

SELECT SUM (c1) FROM t1 WHERE c3 LIKE ‘a’;
GO

ALTER INDEX t1_CL ON t1 REORGANIZE;
GO

ALTER INDEX t1_CL ON t1 REBUILD;
GO

That should have generated some long-running queries. Now I’ll switch in to the context of master (so the querying of the event session itself doesn’t get captured by the event session) and see what I’ve captured.

USE master;
GO

SELECT COUNT (*) FROM sys.fn_xe_file_target_read_file
(‘C:\SQLskills\EE_ExpensiveQueries*.xel’, ‘C:\SQLskills\EE_ExpensiveQueries*.xem’, NULL, NULL);
GO

In this case I’ve got 3 entries. I can pull these out using the following code:

SELECT data FROM
(SELECT CONVERT (XML, event_data) AS data FROM sys.fn_xe_file_target_read_file
(‘C:\SQLskills\EE_ExpensiveQueries*.xel’, ‘C:\SQLskills\EE_ExpensiveQueries*.xem’, NULL, NULL)
) entries;
GO

But I get three XML blobs back, like so:

 

What’s more useful is to pull everything out of the XML blob programmatically using the code below:

SELECT
data.value (
‘(/event[@name=”sql_statement_completed”]/@timestamp)[1]’, ‘DATETIME’) AS [Time],
data.value (
‘(/event/data[@name=”cpu”]/value)[1]’, ‘INT’) AS [CPU (ms)],
CONVERT (FLOAT, data.value (‘(/event/data[@name=”duration”]/value)[1]’, ‘BIGINT’)) / 1000000
AS [Duration (s)],
data.value (
‘(/event/action[@name=”sql_text”]/value)[1]’, ‘VARCHAR(MAX)’) AS [SQL Statement],
‘0x’ + data.value (‘(/event/action[@name=”plan_handle”]/value)[1]’, ‘VARCHAR(100)’) AS [Plan Handle]
FROM
(SELECT CONVERT (XML, event_data) AS data FROM sys.fn_xe_file_target_read_file
(‘C:\SQLskills\EE_ExpensiveQueries*.xel’, ‘C:\SQLskills\EE_ExpensiveQueries*.xem’, null, null)
) entries
ORDER BY [Time] DESC;
GO

Time                    CPU (ms) Duration (s) SQL Statement                              Plan Handle
———————– ——– ———— —————————————— ————————————————–
2009-10-16 17:59:29.623 30       1.214875     ALTER INDEX t1_CL ON t1 REBUILD;           0x06001000EB672A07B8C0C807000000000000000000000000
2009-10-16 17:59:28.407 20       0.024076     ALTER INDEX t1_CL ON t1 REORGANIZE;        0x0600100003594903B8C0C807000000000000000000000000
2009-10-16 17:59:28.343 51       0.045144     SELECT SUM (c1) FROM t1 WHERE c3 LIKE ‘a’; 0x06001000FAF5B11EB820C307000000000000000000000000

Now I can plug in one of the plan handles to a query of sys.dm_exec_query_plan to get the graphical query plan:

SELECT [query_plan] FROM sys.dm_exec_query_plan (0x06001000FAF5B11EB820C307000000000000000000000000);
GO

And clicking on the resulting XML ‘link’ gives the query plan:

 

 

And now I can tweak the production workload to potentially behave better.

Just for kicks I went back into the context of the production database and ran the XML parsing again to capture the query plan – try it yourself – pretty gnarly! :-)

Now, if I was going to make this more useful, I’d use a ring buffer target, with a polling mechanism every few seconds to make sure that I can capture the graphical query plan for expensive queries before the plan is pushed out of cache – maybe I’ll get around to doing that and publish the complete solution.

In the meantime, here’s a zip file containing the entire script that you can download and play with: EE_ExpensiveQueries.zip (1.72 kb)

Enjoy!

SQL Server 2008 High Availability whitepaper published on MSDN

The 35-page whitepaper on high availability I wrote for the SQL team over the summer has been published on MSDN. It’s a 2-300 level whitepaper that describes the various high-availability technologies in SQL Server 2008 and how they can be used to mitigate disasters. It’s chock-full of links to other whitepapers, technical articles and Books Online sections and also presents my methodology for planning a high-availability strategy.

You can get it at http://msdn.microsoft.com/en-us/library/ee523927.aspx.

Enjoy!

Here’s the table of contents:

  • Introduction
  • Causes of Downtime and Data Loss
    • Planned Downtime
    • Unplanned Downtime and Data Loss
  • Planning a High-Availability Strategy
    • Requirements
    • Limitations
    • Technology Evaluation
  • SQL Server 2008 High-Availability Technologies
    • Logging and Recovery
    • Backup, Restore, and Related Technologies
      • Partial Database Availability and Online Piecemeal Restore
      • Instant File Initialization
      • Mirrored Backups
      • Backup Checksums
      • Backup Compression
    • Other Single-Instance Technologies
      • Online Operations
      • Database Snapshots
      • Hot-Add Memory and CPU
      • Resource Governor
    • Multi-Instance Technologies
      • Log Shipping
      • Transactional Replication
      • Database Mirroring
      • Failover Clustering
      • Combining Multi-Instance Technologies
      • Virtualization
  • Mitigating the Causes of Downtime and Data Loss
  • High-Availability Features Supported by SQL Server 2008 Editions
  • Conclusion

DBCC CHECKFILEGROUP bug on SQL Server 2008

Here's an interesting bug that surfaced recently, first reported by Bryan Smith on the MSDN disaster recovery/HA forum three weeks ago.

One of the mechanisms I advise for running consistency checks on VLDBs with multiple filegroups is to run successive DBCC CHECKFILEGROUP operations on the filegroups holding the partitions (see CHECKDB From Every Angle: Consistency Checking Options for a VLDB). Prior to SQL Server 2005 SP3, if a table or index is split into multiple partitions, then DBCC CHECKFILEGROUP would skip checking the entire table or index if it was partitioned over multiple filegroups. From SQL Server 2005 SP3 onwards, DBCC CHECKFILEGROUP will validate only the partitions of tables and indexes that reside on the filegroup being checked – rather than skipping the whole table or index – a big improvement.

Now it seems that SQL Server 2008 has a bug where it essentially has regressed back to the old behavior where DBCC CHECKFILEGROUP will skip a table or index if it's not wholely contained on the filegroup being checked.

Here's a script you can use to test this. It creates a partitioned table over multiple filegroups and then runs DBCC CHECKFILEGROUP on the first partition. I'll discuss the results after the script.

CREATE DATABASE DBMaint2008;
GO
USE DBMaint2008;
GO

ALTER DATABASE DBMaint2008 ADD FILEGROUP DataPartition1;
ALTER DATABASE DBMaint2008 ADD FILEGROUP DataPartition2;
ALTER DATABASE DBMaint2008 ADD FILEGROUP DataPartition3;
GO

ALTER DATABASE DBMaint2008 ADD FILE
   (NAME = N'DataPartition1', FILENAME = N'C:\SQLskills\DataPartition1.ndf',
   SIZE = 10, FILEGROWTH = 10)
TO FILEGROUP DataPartition1;
GO
ALTER DATABASE DBMaint2008 ADD FILE
   (NAME = N'DataPartition2', FILENAME = N'C:\SQLskills\DataPartition2.ndf',
   SIZE = 10, FILEGROWTH = 10)
TO FILEGROUP DataPartition2;
GO
ALTER DATABASE DBMaint2008 ADD FILE
   (NAME = N'DataPartition3', FILENAME = N'C:\SQLskills\DataPartition3.ndf',
   SIZE = 10, FILEGROWTH = 10)
TO FILEGROUP DataPartition3;
GO

CREATE PARTITION FUNCTION Partitions_PFN (INT)
AS RANGE RIGHT FOR VALUES (1000, 2000);
GO

CREATE PARTITION SCHEME [Partitions_PS]
AS PARTITION [Partitions_PFN] TO (DataPartition1, DataPartition2, DataPartition3);
GO

CREATE TABLE TestTable (c1 INT IDENTITY, c2 DATETIME DEFAULT GETDATE ());
CREATE UNIQUE CLUSTERED INDEX TestPK ON TestTable (c1) ON Partitions_PS (c1);
GO

SET NOCOUNT ON;
GO
INSERT INTO TestTable DEFAULT VALUES;
GO 3000

DBCC CHECKFILEGROUP (DataPartition1);
GO

On SQL Server 2005 SP3, the output from the final batch contains:

DBCC results for 'TestTable'.
Cannot process rowset ID 72057594038452224 of object "TestTable" (ID 2073058421), index "TestPK" (ID 1), because it resides on filegroup "DataPartition2" (ID 3), which was not checked. 
Cannot process rowset ID 72057594038517760 of object "TestTable" (ID 2073058421), index "TestPK" (ID 1), because it resides on filegroup "DataPartition3" (ID 4), which was not checked. 
There are 999 rows in 3 pages for object "TestTable".

This shows that it processed the 1000 rows in the first partition, but not the other two – as we'd expect. 

On SQL Server 2008, the output for TestTable is limited to:

Cannot process rowset ID 72057594038910976 of object "TestTable" (ID 2105058535), index "TestPK" (ID 1), because it resides on filegroup "DataPartition2" (ID 3), which was not checked.

And that's it – nothing about it processing any rows in partition 1. This shows that DBCC didn't process the first partition as we'd expect – this becomes even more apparent with very large amounts of data, where DBCC CHECKFILEGROUP will just complete almost instantly.

As Bryan says in his post, Microsoft has acknowledged this is a bug and it should hopefully be fixed for 2008 CU5. In the meantime, this is something you should be aware of as your tables may not be being checked properly.