SQLskills SQL101: Query plans based on what’s in memory

As Kimberly blogged about recently, SQLskills is embarking on a new initiative to blog about basic topics, which we’re calling SQL101. We’ll all be blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

One of the topics that I discuss in class today is why the query optimizer doesn’t know (or care) what’s in the buffer pool. (The query optimizer is the part of the query processor that’s responsible for compiling an efficient query plan, and the buffer pool is the cache of database data file pages that are in memory.)

Let’s investigate…

Scenario

Here’s a scenario:

  • Table T has two nonclustered indexes, A and B, that both cover query Q (a simple SELECT query)
  • Query Q will require a complete index scan of either index
  • Index A has 10,000 pages at its leaf level
  • Index B has 50,000 pages at its leaf level

Which index will the query optimizer use when compiling the query plan?

Cost-based…

SQL Server uses a cost-based optimizer, which uses various metrics and statistics to determine the most efficient query plan for the query (given the time limits imposed on its search of the space of all possible query plans). The ‘cost’ in ‘cost-based’ means that it considers the CPU cost and I/O cost of the various operators in the query plan, with the I/O cost essentially being relative to the number of physical reads required. And it assumes that nothing is in memory.

In the scenario above, the optimizer will choose a query plan using index A, as the most efficient plan will be the one involving the fewest phsyical reads and with such a large difference between the page counts of indexes A and B, index A will be chosen for sure.

Hypothetical memory-based…

Now let’s allow a hypothetical optimizer to base its plan choice on what’s in the buffer pool.

If index A is mostly not in the buffer pool and index B is mostly in the buffer pool, it would be more efficient to compile the query plan to use index B, for a query running at that instant. Even though index B is larger, and would need more CPU cycles to scan through, physical reads are waaaay more expensive (in terms of elapsed time) than CPU cycles so a more efficient query plan is the one that minimizes the number of physical reads.

This argument only holds, and a ‘use index B’ query plan is only more efficient than a ‘use index A’ query plan, if index B remains mostly in memory, and index A remains mostly not in memory. As soon as the relative proportions of indexes A and B that are in memory become such that the ‘use index A’ query plan would be more efficient, the ‘use index B’ query plan is the wrong choice.

The situations when the compiled ‘use index B’ plan is less efficient than the cost-based ‘use index A’ plan are (generalizing):

  • Indexes A and B are both memory resident: the compiled plan will use roughly 5 times more CPU than the optimal plan, as there are 5 times more pages to scan.
  • Neither index is memory resident: the compiled plan will do 5 times the number of physical reads AND use roughly 5 times more CPU.
  • Index A is memory resident and index B isn’t: all physical reads performed by the plan are extraneous, AND it will use roughly 5 times more CPU.

This means that the ‘use index B’ plan is really only the optimal plan at the time the query was compiled.

So although a hypothetical optimizer could make use of buffer pool contents knowledge to compile a query that is the most efficient at a single instant, it would be a very dangerous way to drive plan compilation because of the potential volatility of the buffer pool contents, making the future efficiency of the cached compiled plan highly unreliable.

And I also haven’t mentioned the extra cost of maintaining buffer pool contents knowledge in real time, and then potentially having to recompile queries that are now deemed to be inefficient because buffer pool contents have changed.

Summary

Although it doesn’t always get it right, the optimizer strives to produce the most efficient plan, assuming nothing is in the buffer pool. Understanding how the query optimizer comes to plan choice decisions is extremely useful for understanding query plans themselves and relating them to the code driving the plan.

Hope you found this helpful!

Reconciling set-based operations with row-by-row iterative processing

Yesterday in class we had a discussion around the conceptual problem of reconciling the fact that SQL Server does set-based operations, but that it does them in query plans that pass single rows around between operators. In other words, it uses iterative processing to implement set-based operators.

The crux of the discussion is: if SQL Server is passing single rows around, how is that set-based operations?

I explained it in two different ways…

SQL Server Example

This explanation compares two ways of doing the following logical operation using SQL Server: update all the rows in the Products table where ProductType = 1 and set the Price field to be 10% higher.

The cursor based way (row-by-agonizing-row, or RBAR) would be something like the following:

DECLARE @ProductID   INT;
DECLARE @Price       FLOAT;

DECLARE [MyUpdate] CURSOR FAST_FORWARD FOR
SELECT [ProductID], [Price]
FROM [Products]
WHERE [ProductType] = 1;

OPEN [MyUpdate];

FETCH NEXT FROM [MyUpdate] INTO @ProductID, @Price;

WHILE @@FETCH_STATUS = 0
BEGIN
    UPDATE [Products]
    SET [Price] = @Price * 1.1
    WHERE [ProductID] = @ProductID;

    FETCH NEXT FROM [MyUpdate] INTO @ProductID, @Price;
END

CLOSE [MyUpdate];
DEALLOCATE [MyUpdate];

This method has to set up a scan over the Products table based on the ProductType, and then runs a separate UPDATE transaction for each row returned from the scan, incurring all the overhead of setting up the UPDATE query, starting the transaction, seeking to the correct row based on the ProductID, updating it, and tearing down the transaction and query framework again each time.

The set-based way of doing it would be:

UPDATE [Products]
SET [Price] = [Price] * 1.1
WHERE [ProductType] = 1;

This will have one scan based on the ProductType, which will update rows matching the ProductType, but the query, transaction, and scan are only set up once, and then all the rows are processed, one-at-a-time inside SQL Server.

The difference is that in the set-based way, all the iteration is done inside SQL Server, in the most efficient way it can, rather than manually iterating outside of SQL Server using the cursor.

Non-Technical Example

This explanation involves a similar problem but not involving SQL Server. Imagine you need to acquire twelve 4′ x 8′ plywood sheets from your local home improvement store.

You could drive to and from the store twelve times, and each time you need to go into the store, purchase the sheet, and wait for a staff member to become available to load the sheet into your pickup truck, then drive home and unload the sheet.

Or you could drive to the store once and purchase all twelve sheets in one go, with maybe four staff members making three trips each out to your pickup, carrying one sheet each time. Or even just one staff member making twelve trips out to your pickup.

Which method is more efficient? Multiple trips to the store or one trip to the store, no matter how many staff members are available to carry the sheets out?

Summary

No-one in their right mind is going to make twelve trips to the home improvement store when one will suffice. Just like no developer should be writing cursor/RBAR code to perform an operation that SQL Server can do in a set-based manner (when possible).

Set-based operations don’t mean that SQL Server processes the whole set at once – that’s clearly not possible as most sets have more rows than your server has processors (so all the rows in the set simply *can’t* be processed at the same time, even if all processors were running the same code at the same time) – but that it can process the set very, very efficiently by only constructing the processing framework (i.e. query plan with operators, scans, etc.) for the operation once and then iterating over the set of rows inside this framework.

PS Check out the technical comment from Conor Cunningham below (Architect on the SQL Server team, and my counterpart on the Query Optimizer when I was a Dev Lead in the Storage Engine for SQL Server 2005)

DBCC CHECKDB performance and computed-column indexes

[Edit 2016: The team ‘fixed’ the problem in SQL Server 2016 by skipping consistency checking these indexes unless WITH EXTENDED_LOGICAL_CHECKS is used.]

It’s no secret that DBCC CHECKDB has some performance quirks based on the schema of the database being checked and various kinds of corruptions. I was recently doing some scalability testing of DBCC CHECKDB for a blog post and discovered quite a nasty performance issue that exists in all versions of SQL Server back to SQL Server 2005. This isn’t a bug, it’s just the way things work.

The issue occurs when a nonclustered index exists that has a computed column as part of the index key or as one of the INCLUDEd columns and affects DBCC CHECKDB, DBCC CHECKFILEGROUP, and DBCC CHECKTABLE.

As a bit of background, here’s an excerpt from the SQL Server 2012 Internals book where I describe one of the ways nonclustered indexes are checked for consistency:

The nonclustered index cross-checks verify that

  • Every record in a nonclustered index (whether filtered or nonfiltered) must map to a valid record in the base table (that is, the heap or clustered index).
  • Every record in the base table must map to exactly one record in each nonfiltered, nonclustered index, and one record in each filtered index, where the filter allows.

The mechanism to carry out these checks efficiently has changed in every release since SQL Server 7.0—becoming progressively more and more efficient. In SQL Server 2012, two hash tables are created for each partition of each nonclustered index—one hash table is for the actual records in that partition of the nonclustered index, and the other is for the records that should exist in that partition of the nonclustered index (as calculated from the existing data records in the table).

When a nonclustered index record is processed, all columns in the record are hashed together into a BIGINT value. This includes

  • The physical or logical link back to the base table (known as the base table RID)
  • All included columns—even LOB and FILESTREAM values) are hashed together into a BIGINT value

The resulting value is added to the master hash value for actual records for the nonclustered index partition of which the record is part.

DBCC CHECKDB knows which nonclustered indexes exist for the table and what the complete nonclustered index record composition should be for each. When a data record is processed, the following algorithm is run for each matching nonclustered index record that should exist for the data record (taking into account any filter predicates for filtered nonclustered indexes):

  1. Create the nonclustered index record in memory (again, including the base table RID, plus included columns).
  2. Hash all columns in the index record together into a BIGINT value.
  3. Add the resulting value to the “should exist” master hash value for the relevant nonclustered index partition of which the index record is part.

The premise that this algorithm works on is that if no corruptions exist, the master hash values for the actual records and “should exist” records for each nonclustered index partition should match exactly at the end of the DBCC CHECKDB batch.

When a nonclustered index uses a computed column, the value of the computed column has to be computed based on the column definition. To do that, an internal mechanism called an ‘expression evaluator’ is created. The expression evaluator is provided by the Query Processor code and its behavior is entirely outside the control of DBCC CHECKDB. The drawback of the expression evaluator is that every time it is used, an exclusive latch must be held by the thread using it. This creates an incredible bottleneck and drastically affects performance.

My test system is the Dell R720 we bought last year, with 64GB of memory and 2 E5-2670 CPUs with 8 physical cores and hyperthreading enabled. It’s running SQL Server 2012 SP1 CU3.

The test database is AdventureWorks that Jonathan expanded to 500GB for me using his cool scripts. The database is split over 8 data files stored on two 320GB Fusion-io drives, with tempdb and its log placed on two more 320GB Fusion-io drives. I set things up this way to remove I/O waits from the test.

The first few runs of my testing, with unlimited parallelism produced DBCC CHECKDB elapsed times of around 340 minutes. This seemed incredibly slow to me, so I ran another test and looked at the output of sys.dm_os_waiting_tasks. I discovered that half the threads at any time were waiting for exclusive access to the DBCC_OBJECT_METADATA latch. Adding in some diagnostics to capture waits and latches gave the data below:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------- ---------- ----------- --------- ---------- ----------- ---------- --------- ---------
CXPACKET        684482.45  682667.13   1815.32   60294236   34.90       0.0114     0.0113    0.0000
OLEDB           659325.08  659325.08   0.00      207661462  33.62       0.0032     0.0032    0.0000
LATCH_EX        615168.39  605357.28   9811.11   798224634  31.37       0.0008     0.0008    0.0000

LatchClass               Wait_S     WaitCount  Percentage  AvgWait_S
------------------------ ---------- ---------- ----------- ----------
DBCC_OBJECT_METADATA     611768.00  764845636  99.16       0.0008

This didn’t make any sense to me as I couldn’t believe that such an incredible bottleneck existed – almost 1ms per latch wait and 800 million latch waits! – surely we’d have heard about it by now? I pinged my friend Ryan Stonecipher who inherited the DBCC CHECKDB code from me and he reminded me about the expression evaluator behavior.

I went back to my test system and looked for nonclustered indexes using computed columns using the code below:

SELECT
    [s].[name],
    [o].[name],
    [i].[name],
    [co].[name],
    [ic].*
FROM sys.columns [co]
JOIN sys.index_columns [ic]
    ON [ic].[object_id] = [co].[object_id]
    AND [ic].[column_id] = [co].[column_id]
JOIN sys.indexes [i]
    ON [i].[object_id] = [ic].[object_id]
    AND [i].[index_id] = [ic].[index_id]
JOIN sys.objects [o]
    ON [i].[object_id] = [o].[object_id]
JOIN sys.schemas [s]
    ON [o].[schema_id] = [s].[schema_id]
WHERE [co].[is_computed] = 1;
GO

And I found 6 nonclustered indexes with computed columns in, on some of the largest tables in the database. I then disabled those nonclustered indexes and re-ran the DBCC CHECKDB tests.

17-18 minutes per run.

Wow! The bottleneck from the expression evaluator was causing DBCC CHECKDB to run about 20 times slower.

The wait and latch details for the test runs with the indexes disabled are as below, which is what I expected to see from running DBCC CHECKDB:

WaitType        Wait_S     Resource_S  Signal_S  WaitCount  Percentage  AvgWait_S  AvgRes_S  AvgSig_S
--------------- ---------- ----------- --------- ---------- ----------- ---------- --------- ---------
CXPACKET        33064.56   29331.74    3732.83   42034533   48.14       0.0008     0.0007    0.0001
OLEDB           28883.78   28883.78    0.00      173940154  42.05       0.0002     0.0002    0.0000
LATCH_EX        5021.20    4605.50     415.70    30340659   7.31        0.0002     0.0002    0.0000

LatchClass               Wait_S     WaitCount  Percentage  AvgWait_S
------------------------ ---------- ---------- ----------- ----------
DBCC_CHECK_AGGREGATE     5039.36    30267451   98.82       0.0002

There’s nothing we can do about this bottleneck except to disable nonclustered indexes on computed columns while DBCC CHECKDB is running and then rebuild them afterwards, which is not a very palatable solution. I wanted to blog about this just to make the issue known and help any of you who are beating your head against a wall trying to figure out poor DBCC CHECKDB performance.

[Edit 12/2/2014: Here’s another set of performance tests using a kCura database schema of various sizes.]