DBCC CHECKDB Execution Memory Grants – Not Quite What You Expect

I was recently working on a client system that has problems running DBCC CHECKDB normally due to the concurrent volume of data changes occurring against their 2TB OLTP database.  The challenge with running DBCC CHECKDB on this system is that the database snapshot (which DBCC CHECKDB creates) quickly exceeds the sparse file limits in Windows and their check fails with error 665.

To work around this I wrote a custom procedure for them that breaks down the DBCC CHECKDB checks into individual checks and spreads the process over a period of roughly two weeks, within a 2-3 hour window each morning when the system has the least amount of other activity.

While I was working on changes to this custom script I noticed that DBCC CHECKTABLE was taking a ~96GB execution memory grant during its execution.  The server is a Dell R720 with 2 x Intel E5-2690 processors (2.9GHz 8 cores and HT is enabled so 16 logical cores per socket) and 512GB of RAM.  I mentioned this to Paul and it became something that we wanted to investigate further.

We have a similar R720 with a multiple Fusion-IO PCI-X SSDs installed in it as a test server, but it only has 64GB RAM.  I had already setup a 500GB test database for Paul’s DBCC CHECKDB performance tests, using AdventureWorks and my scripts to enlarge it by adding new objects and creating a number of enlarged versions of the SalesOrder* tables in SQL Server 2012 SP1 with CU3, so when Paul finished his tests, I decided to take a deeper look at how DBCC CHECKDB uses execution memory in SQL Server.

The SQL instance was configured with ‘max server memory’ at 54000 which leaves roughly 4GB of available memory on the server at all times.  I then wrote a test harness to perform DBCC CHECKDB with ‘max degree of parallelism’ set a 32, 16, 8, and 4 that logged the start and end time of each test, the tempdb usage for each test, and the deltas for wait, latch, and spinlock stats for the each test.  Under the default configuration, DBCC CHECKDB acquired a 10GB memory grant from the instance.  I ran the test harness through four iterations of tests and had the following averages:

Default Configuration Results

Default Configuration Results

I then configured Resource Governor on the instance and created a Resource Pool that had MAX_MEMORY set at 10% and a Workload Group in the pool with REQUEST_MAX_MEMORY_GRANT_PERCENT set at 25% which yields a ~1GB maximum execution grant size for sessions assigned to the group.

CREATE RESOURCE POOL [Maint]
WITH (min_cpu_percent=0,
      max_cpu_percent=100,
      min_memory_percent=0,
      max_memory_percent=10,
      cap_cpu_percent=100,
      AFFINITY SCHEDULER = AUTO);
GO
CREATE WORKLOAD GROUP [wg_Maint]
WITH (group_max_requests=0,
      importance=Medium,
      request_max_cpu_time_sec=0,
      request_max_memory_grant_percent=25,
      request_memory_grant_timeout_sec=0,
      max_dop=0)
USING [Maint];
GO

Next I created a dbcc_user login on the server, specifically for performing DBCC CHECKDB, added it to the sysadmin server role, and created a classifier function to place connections from this login into the wg_Maint group for testing the effects of reducing the memory grant for DBCC CHECKDB.

USE [master];
GO
CREATE LOGIN [dbcc_user] WITH PASSWORD=N'R3@lly$tr0ngP@$$w0rd!', DEFAULT_DATABASE=[master], CHECK_EXPIRATION=OFF, CHECK_POLICY=OFF;
GO
ALTER SERVER ROLE [sysadmin] ADD MEMBER [dbcc_user];
GO

--- Create the classifier function
CREATE FUNCTION fnRGClassifier()
RETURNS SYSNAME
WITH SCHEMABINDING
AS
BEGIN
     DECLARE @group SYSNAME;
     IF(SUSER_NAME() = 'dbcc_user')
     BEGIN
          SET @group =  N'wg_Maint';
     END
--- Use the default workload group if there is no match on the lookup.
     ELSE
        BEGIN
             SET @group =  N'default';
       END
       RETURN @group;
END
GO

--- Reconfigure the Resource Governor to use the new function
ALTER RESOURCE GOVERNOR with (CLASSIFIER_FUNCTION = dbo.fnRGClassifier);
ALTER RESOURCE GOVERNOR RECONFIGURE;
GO

Next I changed the test harness to login as the dbcc_user and reran the series of tests an additional four times with the following averages for the results of each test:

1GB Memory Grant Results

1GB Memory Grant Results

Reviewing the data, you will see that the elapsed time decreased for all of the tests, while the tempdb usage did not increase significantly as a result of running with only 1GB of execution memory granted.  This reduction in execution memory allowed the remaining 9GB to be used by the data cache for database pages during the tests, which was confirmed by capturing the Memory Manager and Buffer Manager performance counters for the instance during the testing.

I then decided to see what different memory effects had and ran the tests at 2GB, 520MB and 240MB by changing the Workload Group REQUEST_MAX_MEMORY_GRANT_PERCENT to values of 50%, 13%, and 6% respectively.  Using the same series of tests for additional times per configuration.  The following averages were observed after all of the tests were completed:

Additional Lower Memory Grant Results

Additional Lower Memory Grant Results

From the results, the only time that tempdb usage increased, was for the 240MB execution grant size and then only for the DOP 32 and DOP 16 tests.  The total increase in tempdb usage was roughly 300MB in the worst case, which is considerably smaller than the gains in data cache space savings by reducing the execution memory grant size for DBCC CHECKDB.  The fastest average execution time for this specific database was obtained by using 520MB for all of the levels of DOP that were tested.

After completing this testing, Paul and I spent an hour on a call reviewing the spinlock stats, the wait stats, and the latch stats for each of the tests to try to identify why performance improved by reducing the execution memory grant with Resource Governor like this.  The only thing that pointed to an explanation for the improvement in performance is the higher amount of memory available for database pages as a result of reducing the execution memory grant.

Additionally, using Extended Events, I attempted to collect the query_post_execution_showplan events for the executions under the default configuration and under the constrained Resource Governor configuration.  It only received a few parts of the actual execution plan information, but the plans returned were identical with the exception of the memory grant information for RequestedMemory, which matched the limitations that were in place for the Resource Governor tests.

The most interesting item from the execution plan however, was the value for the desired memory, which was 45,182,976,776 KB, or roughly 43TB.  I don’t have an answer, yet, to why this occurs, but I hope to be able to figure it out at some point. It appears that the costing model for DBCC CHECKDB has limitations that weren’t apparent years ago when servers didn’t have the levels of memory available in today’s servers. Even doing a DBCC CHECKDB against the master database, which is 5MB in size on this instance, requests a ~2GB memory grant for execution, so it really seems to point to an excessive costing algorithm for DBCC CHECKDB inside of SQL Server.

(From Paul: the way that costing is performed for DBCC CHECKDB is based on the expected number of bits of information DBCC CHECKDB will create in the rowset it gives to the query processor to sort, hash, and give back to DBCC again. This cardinality estimation method has been the same since SQL Server 2000.)

In summary, I’ll be reviewing our clients’ systems and implementing the Resource Governor method I described above on systems where it will make a difference to performance by not taking so much of the buffer pool. I suggest you consider doing this too.

Tracking SQL Server Database Usage

One of the challenges with inheriting an existing set of SQL Servers and databases when you start a new job can be determining which of the databases are actively being used or not, especially on development and testing systems where there may be multiple copies of the same database that were used by different projects over time. This can also be a challenge for multi-tenant software as a service (SaaS) providers that create a new database for each client they provide service for. An easy way to track whether a database is being used is with Extended Events and the lock_acquired event by filtering for the shared transaction workspace (SharedXactWorkspace) lock that is acquired anytime a user connects to the database.

To start off, we first need to look up the columns returned by the lock_acquired event, and also look up the map values associated with any of the columns so that we know the correct values to use in our event session definition.

-- Look up the lock_acquired event columns
SELECT 
	name,
	column_id,
	type_name
FROM sys.dm_xe_object_columns
WHERE object_name = N'lock_acquired' AND
	column_type = N'data';

-- Look up the values for the Lock Resource Type and the Lock Owner Type
SELECT 
	name,
	map_key,
	map_value
FROM sys.dm_xe_map_values
WHERE name IN (N'lock_resource_type',
N'lock_owner_type');

From this, we can get the DATABASE lock_resource_type map_key=2 and the SharedXactWorkspace lock_owner_type map_key=4. With these values, we can define our event session to track how frequently this lock occurs by database_id, and leverage the bucketizer/histogram target to bucket the data automatically. Since the target name and output changed slightly in SQL Server 2012, two different version specific examples of the event session and event parsing code are presented below:

SQL Server 2008 Event Session

-- If the Event Session Exists, drop it first
IF EXISTS (SELECT 1 
			FROM sys.server_event_sessions 
			WHERE name = 'SQLskills_DatabaseUsage')
	DROP EVENT SESSION [SQLskills_DatabaseUsage] 
	ON SERVER;

-- Create the Event Session
CREATE EVENT SESSION [SQLskills_DatabaseUsage] 
ON SERVER 
ADD EVENT sqlserver.lock_acquired( 
	WHERE owner_type = 4 -- SharedXactWorkspace
	  AND resource_type = 2 -- Database level lock
	  AND database_id > 4 -- non system database
	  AND sqlserver.is_system = 0 -- must be a user process
) 
ADD TARGET package0.asynchronous_bucketizer
( SET slots = 32, -- Adjust based on number of databases in instance
	  filtering_event_name='sqlserver.lock_acquired', -- aggregate on the lock_acquired event
	  source_type=0, -- event data and not action data
	  source='database_id' -- aggregate by the database_id
)
WITH(MAX_DISPATCH_LATENCY =1SECONDS); -- dispatch immediately and don't wait for full buffers
GO

-- Start the Event Session
ALTER EVENT SESSION [SQLskills_DatabaseUsage] 
ON SERVER 
STATE = START;
GO

-- Parse the session data to determine the databases being used.
SELECT  slot.value('./@count', 'int') AS [Count] ,
        DB_NAME(slot.query('./value').value('.', 'int')) AS [Database]
FROM
(
	SELECT CAST(target_data AS XML) AS target_data
	FROM sys.dm_xe_session_targets AS t
    INNER JOIN sys.dm_xe_sessions AS s 
		ON t.event_session_address = s.address
	WHERE   s.name = 'SQLskills_DatabaseUsage'
	  AND t.target_name = 'asynchronous_bucketizer') AS tgt(target_data)
CROSS APPLY target_data.nodes('/BucketizerTarget/Slot') AS bucket(slot)
ORDER BY slot.value('./@count', 'int') DESC

GO

SQL Server 2012 Event Session

-- If the Event Session Exists, drop it first
IF EXISTS (SELECT 1 
			FROM sys.server_event_sessions 
			WHERE name = 'SQLskills_DatabaseUsage')
	DROP EVENT SESSION [SQLskills_DatabaseUsage] 
	ON SERVER;

-- Create the Event Session
CREATE EVENT SESSION [SQLskills_DatabaseUsage] 
ON SERVER 
ADD EVENT sqlserver.lock_acquired( 
	WHERE owner_type = 4 -- SharedXactWorkspace
	  AND resource_type = 2 -- Database level lock
	  AND database_id > 4 -- non system database
	  AND sqlserver.is_system = 0 -- must be a user process
) 
ADD TARGET package0.histogram
( SET slots = 32, -- Adjust based on number of databases in instance
	  filtering_event_name='sqlserver.lock_acquired', -- aggregate on the lock_acquired event
	  source_type=0, -- event data and not action data
	  source='database_id' -- aggregate by the database_id
); -- dispatch immediately and don't wait for full buffers
GO

-- Start the Event Session
ALTER EVENT SESSION [SQLskills_DatabaseUsage] 
ON SERVER 
STATE = START;
GO

-- Parse the session data to determine the databases being used.
SELECT  slot.value('./@count', 'int') AS [Count] ,
        DB_NAME(slot.query('./value').value('.', 'int')) AS [Database]
FROM
(
	SELECT CAST(target_data AS XML) AS target_data
	FROM sys.dm_xe_session_targets AS t
    INNER JOIN sys.dm_xe_sessions AS s 
		ON t.event_session_address = s.address
	WHERE   s.name = 'SQLskills_DatabaseUsage'
	  AND t.target_name = 'histogram') AS tgt(target_data)
CROSS APPLY target_data.nodes('/HistogramTarget/Slot') AS bucket(slot)
ORDER BY slot.value('./@count', 'int') DESC

GO

One thing to keep in mind with this event session is that while a end user might not actually use a database, other tasks like maintenance, backups, CHECKDB, or even using intellisense in SQL Server Management Studio will. It is therefore, expected that databases not being used by end users would still show up inside of the histogram target, but the frequency of usage would be significantly lower than the databases that are actively being used by end users or applications.

SQL Saturday #192 Content Uploaded

I know that it is somewhat late, but I uploaded my slide decks and demos from SQL Saturday #192 in Tampa, FL to the SQL Saturday website earlier today. For simplicity in finding the information, the links to each of the sessions are below:

Performance Tuning by Digging into the Plan Cache

Successfully Virtualizing SQL Server in Production

Free Tools for the DBA

For those that attended the sessions, once again thank you for your time and I hope you gained something from the content that was presented.  Next up is SQL Saturday #202 in Edinburgh, UK where I’ll be co-presenting a Precon with Joe Sack on SQL Server 2012 New Features.  Hope to see you there if you are local.