As Kimberly blogged about recently, SQLskills is embarking on a new initiative to blog about basic topics, which we’re calling SQL101. We’ll all be blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.
One of the topics that I discuss in class today is why the query optimizer doesn’t know (or care) what’s in the buffer pool. (The query optimizer is the part of the query processor that’s responsible for compiling an efficient query plan, and the buffer pool is the cache of database data file pages that are in memory.)
Here’s a scenario:
- Table T has two nonclustered indexes, A and B, that both cover query Q (a simple SELECT query)
- Query Q will require a complete index scan of either index
- Index A has 10,000 pages at its leaf level
- Index B has 50,000 pages at its leaf level
Which index will the query optimizer use when compiling the query plan?
SQL Server uses a cost-based optimizer, which uses various metrics and statistics to determine the most efficient query plan for the query (given the time limits imposed on its search of the space of all possible query plans). The ‘cost’ in ‘cost-based’ means that it considers the CPU cost and I/O cost of the various operators in the query plan, with the I/O cost essentially being relative to the number of physical reads required. And it assumes that nothing is in memory.
In the scenario above, the optimizer will choose a query plan using index A, as the most efficient plan will be the one involving the fewest phsyical reads and with such a large difference between the page counts of indexes A and B, index A will be chosen for sure.
Now let’s allow a hypothetical optimizer to base its plan choice on what’s in the buffer pool.
If index A is mostly not in the buffer pool and index B is mostly in the buffer pool, it would be more efficient to compile the query plan to use index B, for a query running at that instant. Even though index B is larger, and would need more CPU cycles to scan through, physical reads are waaaay more expensive (in terms of elapsed time) than CPU cycles so a more efficient query plan is the one that minimizes the number of physical reads.
This argument only holds, and a ‘use index B’ query plan is only more efficient than a ‘use index A’ query plan, if index B remains mostly in memory, and index A remains mostly not in memory. As soon as the relative proportions of indexes A and B that are in memory become such that the ‘use index A’ query plan would be more efficient, the ‘use index B’ query plan is the wrong choice.
The situations when the compiled ‘use index B’ plan is less efficient than the cost-based ‘use index A’ plan are (generalizing):
- Indexes A and B are both memory resident: the compiled plan will use roughly 5 times more CPU than the optimal plan, as there are 5 times more pages to scan.
- Neither index is memory resident: the compiled plan will do 5 times the number of physical reads AND use roughly 5 times more CPU.
- Index A is memory resident and index B isn’t: all physical reads performed by the plan are extraneous, AND it will use roughly 5 times more CPU.
This means that the ‘use index B’ plan is really only the optimal plan at the time the query was compiled.
So although a hypothetical optimizer could make use of buffer pool contents knowledge to compile a query that is the most efficient at a single instant, it would be a very dangerous way to drive plan compilation because of the potential volatility of the buffer pool contents, making the future efficiency of the cached compiled plan highly unreliable.
And I also haven’t mentioned the extra cost of maintaining buffer pool contents knowledge in real time, and then potentially having to recompile queries that are now deemed to be inefficient because buffer pool contents have changed.
Although it doesn’t always get it right, the optimizer strives to produce the most efficient plan, assuming nothing is in the buffer pool. Understanding how the query optimizer comes to plan choice decisions is extremely useful for understanding query plans themselves and relating them to the code driving the plan.
Hope you found this helpful!