Common Table Expressions (CTEs), Window Functions, and Views

Common Table Expressions (CTEs), window functions, and views are all common things in SQL Server development these days. When CTEs and window functions were introduced in SQL Server 2005, a lot of developers jumped in and began using them to solve problems and simplify code. While these tools can be a great benefit in SQL Server, they can also be the cause of significant performance problems in certain scenarios. I recently engaged in some performance tuning with a client after a code change was deployed to their production server that resulted in CPU usage of nearly 100% across all 32 logical processors. The resulting performance problems in their application made this a critical situation for the client. The fix required a few hours rewriting a couple of the modified stored procedures to reduce the CPU usage back down to the normal range of 10-20%.

CPU Usage InitiallyCPU Usage After Rewrites

The root cause of the problem was something I consider to be a coding anti-pattern, using a CTE with a window function inside of a view.  As we’ll see in this post, mixing the three of these together can result in significant performance problems from the need to materialize the intermediate results of the view before filtering can be applied.

Setting up The Problem

To demonstrate the problem I’ll first create a new table in AdventureWorks2012 to show the impact on performance as the number of rows increases in our test table. This table will be based on the HumanResources.EmployeePayHistory table and can be generated with the script attached to this post.

  • To create the table with 316 rows, comment out the second INSERT statement entirely
  • To create the table with 11,916 rows, change the GO for the second INSERT to be GO 40
  • To create the table with 116,316 rows, change the GO for the second INSERT to be GO 400
  • To create the table with 1,160,316, change the GO for the second INSERT to be GO 4000

With my table created, I’ll create a view to return the current pay value for the employees as follows:

CREATE VIEW dbo.EmployeeCurrentPayData
AS

WITH PayChanges AS(
SELECT
    per.BusinessEntityID,
    per.FirstName,
    per.LastName,
    per.MiddleName,
    emp.JobTitle,
    pay.Rate,
    pay.PayFrequency,
    ROW_NUMBER() OVER (PARTITION BY per.BusinessEntityID, JobTitle, PayFrequency ORDER BY pay.RateChangeDate DESC) AS RevisionID
FROM Person.Person AS per
INNER JOIN HumanResources.Employee AS emp
    ON per.BusinessEntityID = emp.BusinessEntityID
LEFT JOIN HumanResources.EmployeePayHistoryEnlarged AS pay
    ON emp.BusinessEntityID = pay.BusinessEntityID)
SELECT
    BusinessEntityID,
    FirstName,
    LastName,
    MiddleName,
    JobTitle,
    Rate,
    PayFrequency
FROM PayChanges
WHERE RevisionID = 1;

GO

Then I’ll create a stored procedure for my application to be able to get the current pay value for a specific employee by the BusinessEntityID for the employee.

CREATE PROCEDURE dbo.GetEmployeeCurrentPay_View (@BusinessEntityID INT)
AS
BEGIN

SELECT
    BusinessEntityID,
    FirstName,
    LastName,
    MiddleName,
    JobTitle,
    Rate,
    PayFrequency
FROM dbo.EmployeeCurrentPayData
WHERE BusinessEntityID = @BusinessEntityID;

END
GO

If I run the stored procedure for any employee’s BusinessEntityID the results will always be a single row.

DECLARE @BusinessEntityID INT = 250;
EXECUTE dbo.GetEmployeeCurrentPay_View @BusinessEntityID;

image

For a table with 316 rows of pay history, the execution plan will have the following shape:

image

The execution plan shows that the tables were all scanned and the CTE was computed before the filter predicate on BusinessEntityID was applied. I can see this in the tool-tip for the Filter operator and by the number of input rows versus output rows for the Filter.

image

With a small number of rows, this doesn’t show up as being a significant problem as shown by the STATISTICS IO and STATISTICS TIME output for the execution.

SQL Server parse and compile time:
   CPU time = 14 ms, elapsed time = 14 ms.

Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 0, logical reads 897, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Employee'. Scan count 1, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'EmployeePayHistoryEnlarged'. Scan count 1, logical reads 4, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 4 ms.

The total execution time was just 4ms with 316 rows. If I scale up our test table to 11,916 rows, the plan changes to using a Merge Join with the EmployeePayHistoryEnlarged table, but I can still see that the CTE was fully computed before the filtering on BusinessEntityID was applied.

image

Performance wise, the stored procedure executes in 85ms and only uses 31ms of CPU time as shown by the execution statistics.

SQL Server parse and compile time:
   CPU time = 0 ms, elapsed time = 9 ms.

Table 'EmployeePayHistoryEnlarged'. Scan count 1, logical reads 86, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 0, logical reads 897, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Employee'. Scan count 1, logical reads 9, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 31 ms,  elapsed time = 85 ms.
 

The real problems start to occur when the number of rows is increased to 116,316. The increased cost of computing the CTE entirely results in an execution plan that uses parallelism.

image

Testing this stored procedure on its own wouldn’t make you think that there was a performance problem lying in wait for a production system. Looking at the execution statistics, the procedure executed in 240ms, but the important item to look at is the shift in proportion of CPU time. Since the query used parallelism, I now have significantly more CPU time, by a factor of 3 in the 4vCPU VM I tested this on, which is the result of multiple threads processing the execution.

SQL Server parse and compile time:
   CPU time = 16 ms, elapsed time = 16 ms.

Table 'Employee'. Scan count 25, logical reads 22, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 25, logical reads 319, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'EmployeePayHistoryEnlarged'. Scan count 25, logical reads 949, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 643 ms,  elapsed time = 240 ms.

The problem with this specific procedure is that it doesn’t just get executed once, it gets executed thousands of times every couple of seconds, so the cumulative effect of the CPU time required to compute the CTE completely becomes significant enough to run a 32-core server at nearly 100% CPU utilization as shown in the images at the top of the post.

The problem only gets worse as the number of rows in the table increases, as shown by the execution plan and execution statistics for 1,160,316 rows.

Note: Depending on the hardware that the tests are run on, the plan might be different for these tests. For example the first plan below was generated with 4 vCPUs while the second plan using the Hash Join and Bitmap optimization was generated on the same server but with the -Pn startup parameter to start SQLOS with 24 logical schedulers, as showed in my post SQL Server and Soft NUMA. In either case parallelism is used and the CPU time is roughly three times the execution time.

image

SQL Server parse and compile time:
   CPU time = 0 ms, elapsed time = 0 ms.

Table 'EmployeePayHistoryEnlarged'. Scan count 290, logical reads 7416, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 0, logical reads 923, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Employee'. Scan count 5, logical reads 22, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 4336 ms,  elapsed time = 1788 ms.

image

SQL Server parse and compile time:
   CPU time = 23 ms, elapsed time = 23 ms.

Table 'Employee'. Scan count 25, logical reads 22, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 25, logical reads 319, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'EmployeePayHistoryEnlarged'. Scan count 290, logical reads 7656, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Worktable'. Scan count 0, logical reads 0, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 6209 ms,  elapsed time = 2235 ms.

How do you fix this?

Fixing this problem is actually very simple; don’t use the VIEW and move the WHERE clause for the BusinessEntityID column up into the CTE so that it can be used to filter the tables before computation occurs.

CREATE PROCEDURE dbo.GetEmployeeCurrentPay_NoView (@BusinessEntityID INT)
AS
BEGIN

WITH PayChanges AS(
SELECT
    per.BusinessEntityID,
    per.FirstName,
    per.LastName,
    per.MiddleName,
    emp.JobTitle,
    pay.Rate,
    pay.PayFrequency,
    ROW_NUMBER() OVER (PARTITION BY per.BusinessEntityID, JobTitle, PayFrequency ORDER BY pay.RateChangeDate DESC) AS RevisionID
FROM Person.Person AS per
INNER JOIN HumanResources.Employee AS emp
    ON per.BusinessEntityID = emp.BusinessEntityID
LEFT JOIN HumanResources.EmployeePayHistoryEnlarged AS pay
    ON emp.BusinessEntityID = pay.BusinessEntityID
WHERE per.BusinessEntityID = @BusinessEntityID)
SELECT
    BusinessEntityID,
    FirstName,
    LastName,
    MiddleName,
    JobTitle,
    Rate,
    PayFrequency
FROM PayChanges
WHERE RevisionID = 1;

END
GO

By pushing the WHERE clause up into the CTE, with 316 rows we get an execution plan that shows only a couple of rows are being returned from the tables during execution.

image

If I look at the tool-tip for the Clustered Index Seek on the EmployeePayHistoryEnlarged table, I’ll see the Seek Predicate being applied to the BusinessEnityID column to filter the results.

image

The execution statistics for the rewritten procedure also shows the reduction in logical reads and the execution time.

SQL Server parse and compile time:
   CPU time = 4 ms, elapsed time = 4 ms.

Table 'EmployeePayHistoryEnlarged'. Scan count 1, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Employee'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 4 ms.

I also get plan stability with this rewrite, regardless of the number of rows I have in our EmployeePayHistoryEnlarged table. At 1,160,316 rows, the execution plan is still the same, except that I can see more rows being returned from EmployeePayHistoryEnlarged to be able to identify the most recent pay rate.

image

The execution statistics also shows the reduction in logical reads, and the lower CPU time and execution time for the procedure.

SQL Server parse and compile time:
   CPU time = 0 ms, elapsed time = 0 ms.

Table 'EmployeePayHistoryEnlarged'. Scan count 1, logical reads 25, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Employee'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 16 ms,  elapsed time = 152 ms.

What was Actually the Problem?

Since I have three different coding aspects at play here, you are probably wondering what exactly was the root problem? If you aren’t wondering that, you aren’t alone, I didn’t either, but after doing a review of the article, my colleague Joe Sack pointed out that it was unclear where the problem actually was. I have to agree with Joe; so far I haven’t really shown definitively where the problem is. Is it the fact that I have a CTE within a view? Let’s find out. First I’ll rewrite a new view that only contains a CTE, but still provides the same result as the original test query, using a sub-query to determine the MAX(RateChangeDate) from the EmployeePayHistoryEnlarged table for the BusinessEntityID and PayFrequency.

/* Drop the view before creating it */
IF OBJECT_ID(N'EmployeeCurrentPayData_NoWF') IS NOT NULL
BEGIN
    DROP VIEW dbo.EmployeeCurrentPayData_NoWF;
END
GO

/* Create the problem view for the tests */
CREATE VIEW dbo.EmployeeCurrentPayData_NoWF
AS
WITH PayChanges AS(
SELECT
    per.BusinessEntityID,
    per.FirstName,
    per.LastName,
    per.MiddleName,
    emp.JobTitle,
    pay.Rate,
    pay.PayFrequency
FROM Person.Person AS per
INNER JOIN HumanResources.Employee AS emp
    ON per.BusinessEntityID = emp.BusinessEntityID
LEFT JOIN HumanResources.EmployeePayHistoryEnlarged AS pay
    ON emp.BusinessEntityID = pay.BusinessEntityID
WHERE RateChangeDate = (SELECT MAX(RateChangeDate)
                        FROM HumanResources.EmployeePayHistoryEnlarged AS pay2
                        WHERE pay.BusinessEntityID = pay2.BusinessEntityID
                          AND pay.PayFrequency = pay.PayFrequency)
)
SELECT
    BusinessEntityID,
    FirstName,
    LastName,
    MiddleName,
    JobTitle,
    Rate,
    PayFrequency
FROM PayChanges;
GO

/* Drop the stored procedure using the view before creating it */
IF OBJECT_ID(N'GetEmployeeCurrentPay_ViewNoWF') IS NOT NULL
BEGIN
    DROP PROCEDURE dbo.GetEmployeeCurrentPay_ViewNoWF;
END
GO

/* Create the stored procedure that uses the view to tes the problem */
CREATE PROCEDURE dbo.GetEmployeeCurrentPay_ViewNoWF (@BusinessEntityID INT)
AS
BEGIN
    SELECT
BusinessEntityID,
    FirstName,
    LastName,
    MiddleName,
    JobTitle,
    Rate,
    PayFrequency
    FROM dbo.EmployeeCurrentPayData_NoWF
    WHERE BusinessEntityID = @BusinessEntityID;
END
GO

Now a simple test by running the new procedure and reviewing the execution plan for 1,160,316 rows:

image

SQL Server parse and compile time:
   CPU time = 12 ms, elapsed time = 12 ms.

Table 'Employee'. Scan count 0, logical reads 2, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'EmployeePayHistoryEnlarged'. Scan count 1, logical reads 6, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.
Table 'Person'. Scan count 0, logical reads 3, physical reads 0, read-ahead reads 0, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

 SQL Server Execution Times:
   CPU time = 0 ms,  elapsed time = 102 ms.

I think it is pretty apparent that using a CTE in a view isn’t the problem. Since I now know that the issue isn’t the CTE within the view, how about the window function inside of the CTE? To test this, I can take the view definition, add a filter to the query against the CTE for the BusinessEntityID, and look at the resulting plan and performance:

DECLARE @BusinessEntityID INT = 250;

WITH PayChanges AS(
SELECT
    per.BusinessEntityID,
    per.FirstName,
    per.LastName,
    per.MiddleName,
    emp.JobTitle,
    pay.Rate,
    pay.PayFrequency,
    ROW_NUMBER() OVER (PARTITION BY per.BusinessEntityID, JobTitle, PayFrequency ORDER BY pay.RateChangeDate DESC) AS RevisionID
FROM Person.Person AS per
INNER JOIN HumanResources.Employee AS emp
    ON per.BusinessEntityID = emp.BusinessEntityID
LEFT JOIN HumanResources.EmployeePayHistoryEnlarged AS pay
    ON emp.BusinessEntityID = pay.BusinessEntityID)
SELECT
    BusinessEntityID,
    FirstName,
    LastName,
    MiddleName,
    JobTitle,
    Rate,
    PayFrequency
FROM PayChanges
WHERE RevisionID = 1
  AND BusinessEntityID = @BusinessEntityID;

image

Since I have the same plan for 1,160,316 rows as my original problem, it’s clear that the issue is the fact that the CTE has to compute the entire result set before it can filter on the BusinessEntityID. So knowing this, why would I even mention views in this post? It’s quite simple, the reason I generally see problems like this occur is that the developer/DBA wanted to simplify the code across multiple objects rather than have the same CTE fully expanded in each of the objects that needed to access the data in this manner. This is the same reason that user defined functions get used as much as they do despite being slower than set-based solutions that require more code in each of the objects that would have used the function for simplicity.

I could test a window function inside a view, but I’ve already shown above that the problem is actually related to the window function inside of the CTE and where the predicate evaluation actually is being applied.

Summary

While using a CTE inside of a view isn’t the actual performance problem, and instead the window function inside of the CTE with filtering after computation is the cause of the performance issue. The desire to use a view to simplify code typically leads to the coding practice which results in the performance problem occurring. By changing the coding pattern to remove the view, and then pushing the WHERE clause up into the CTE definition, only the rows required will be read and computed, reducing the I/O strain on the server, CPU usage, and providing a much more scalability to your code.

This is the first in a series of examples I’ll be blogging about showing some of the problems I come across while solving performance issues for our clients.

Performance Tuning SQL Server on Windows Server 2012

Properly configuring Windows and SQL Server to get the best performance from your server hardware is an important task for database administrators.  There is a lot of information available online with different recommendations about how to configure your servers for the best performance.  The challenge is knowing what recommendations are correct and what advice you should follow when setting up a new server or performance tuning the configuration of an existing server.

Last week I received an email from a member of the community that had attended a user group meeting where the presenter was talking about configuration options for SQL Server on Windows Server 2012.  The source of the information that was being presented is actually a Microsoft whitepaper titled Performance Tuning Guidelines for Windows Server 2012 which has a section titled “Performance Tuning for OLTP Workloads”.  While the document mentions the TPC-E benchmark from the Transaction Processing Performance Council, what is not made clear in this document is that the recommendations in this section DO NOT apply to a majority of the SQL Server installations that exist in the world, they are strictly for servers trying to achieve a high score on the benchmark.  This section also existed in the whitepaper Performance Tuning Guidelines for Windows Server 2008 R2, but under the title “Performance Tuning for TPC-E Workload”, which is more a accurate indication of what it applies to.

If you want to know why this paper doesn’t apply to most workloads, read on.  If you want to know how to configure a SQL Server with general best practices, jump down to the “How should I really configure SQL?” section at the end of this post.

Tuner Specials

The “SQL Server Tunings for OLTP Workloads” section in the whitepaper has a lot of items that I refer to as tuner specials because they exist for benchmarks like TPC-C and TPC-E, and they are documented because the benchmarks require full disclosure of the configuration before the results can be published.  Most of the tuner specials can be found in KB article 920093.  The startup configuration options mentioned in the whitepaper include:

  • -x : Disable SQL Server perfmon counters
  • -T661: Disable the ghost record removal process
  • -T834: Use Microsoft Windows large-page allocations for the buffer pool
  • -T652: Disable page-prefetching scans
  • -T8744: Disallow prefetch for ranges

Of these, the only one that applies to general systems running SQL Server is –T834, to use large-page allocations in servers with large amounts of RAM installed.  Even this trace flag has some important considerations around its usage, since the buffer pool has to allocate its total size at startup from contiguous memory.  If a contiguous allocation is not possible, the instance tries to allocate a smaller value until it finds a contiguous memory region to allocate from. This can significantly increase the instance startup time and is explained in further detail by Bob Ward in his blog post SQL Server and Large Pages Explained.

While there are some workloads where using all these startup options can be beneficial, they don’t apply to general SQL Server workloads.

Priority Boost?

The most interesting thing I found in the whitepaper was the recommendation to set the ‘priority boost’ value to 1 in sp_configure.  This goes against Microsoft’s own recommendations for this configuration option.  When I am doing a health check of a server for a client, this configuration option is a big, red flashing light.  When I see this set to 1 it usually means that I’m going to find a bunch more incorrectly configured options.  Additionally this option has been marked as deprecated since SQL Server 2008 R2 released and will be removed from future releases of the product.  This option is used for TPC-E to gain a slight advantage on a server where all unnecessary background services have been disabled, SQL Server is the only thing running, and the goal is strictly to obtain the best benchmark.

How should I really configure SQL?

My colleague Glenn Berry (Blog | Twitter) knows a lot about performance tuning SQL Server hardware configurations, and he knows a lot about how to provision and configure a new SQL Server Instance, which he shared in his three part series on SimpleTalk (Part 1 | Part 2 | Part 3) last year, and in a new Pluralsight online course this year.  If you’re looking for a guide to setting up a new SQL Server I’d start off there.  I published my own checklist for new server configuration back in 2010, but upon reviewing some of the recommendations I think I’ll update that with a new revised checklist in a future post (for example I’d leave hyper-threading turned on today, and set ‘max degree of parallelism’ based on the NUMA configuration and workload).

We’ve been helping a lot of clients with upgrading to SQL Server 2012 on Windows Server 2012 – if you need help, drop us a line.