Window clause, range specification, and missing values

In the previous post, I insinuated that allowing LAG/LEAD to be sensitive to value RANGEs might help with series of data with missing values. It won't do that unless there's some logic that allows the offset that LAG/LEAD uses to be calculated on a row-by-row basis (i.e. LAG/LEAD uses an expression based on the window values). That's a little too much to expect; notice I've updated the original post to back off a bit on that one. I'll need my densified data for that.

However, a common scenario that would be helped by an expansion of use of RANGE in the window specification is rolling totals/rolling balances. Here's an example.

Suppose I wanted to calculate a 3-month rolling average (or rolling total) of sales on a per-employee basis. Looking at an employee who doesn't make at least one sale every month, I get a skewed total/average. The totals don't reflect the last three calendar months, but instead they reflect the last three months in which the employee had a nonzero sales total. Skews the figures upward a little….

SELECT  Employee,
        [Year] ,
        [Month] ,
        EmployeeTotal AS SalesThisMonth,
        COUNT(*) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS NumberOfMonths,
        SUM(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthTotal,
        AVG(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthAverage
FROM EmployeeSalesByMonth
WHERE Employee = 272
ORDER BY Employee, [Year], [Month];

                        SalesThisMonth Number   ThreeMonthTotal ThreeMonthAverage
272     2005    8       20544.7015      1       20544.7015      20544.7015
272     2005    9       2039.994          2       22584.6955      11292.3477  
272     2005    11     6341.551          3       28926.2465      9642.0821
272     2006    2       61206.4782      3       69588.0232      23196.0077
272     2006    3       18307.746        3       85855.7752      28618.5917
272     2006    4       33406.7043      3       112920.9285     37640.3095

Suppose I had a way to specify a time interval and use RANGE instead of ROWS. Something like: "RANGE BETWEEN '2 MONTHS' PRECEEDING AND CURRENT ROW" where '2 MONTHS' represents a time-interval data type. Then I'd get the "non-skewed" answer. Currently SQL Server Denali doesn't even allow that form the RANGE specification (RANGE BETWEEN <unsigned value specification> PRECEDING) with any data type currently. Instead, I can use the densified data (see previous post) with the rows window and get that "non-skewed" answer.

WITH SalesByMonth AS
(
SELECT Employee,
       DATEPART(yyyy,dt) as Year,
    DATEPART(MONTH,dt) as Month,
    ISNULL(EmployeeTotal, 0) as EmployeeTotal
FROM #allMonths m
OUTER APPLY (
SELECT  EmployeeTotal
FROM    EmployeeSalesByMonth e
WHERE m.Employee = e.Employee  
 AND    DATEPART(yyyy,dt) = [Year]
 AND    DATEPART(mm,dt) = [Month]
) AS t
)
SELECT  Employee,
        [Year] ,
        [Month] ,
        EmployeeTotal AS SalesThisMonth,
        COUNT(*) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS NumberOfMonths,
 SUM(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthTotal,
        AVG(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthAverage
FROM SalesByMonth
WHERE Employee = 272
ORDER BY Employee,
        [Year],
        [Month];
GO

                        SalesThisMonth Number   ThreeMonthTotal ThreeMonthAverage  
272     2005    7       0.00                 1       0.00                 0.00
272     2005    8       20544.7015      2       20544.7015      10272.3507
272     2005    9       2039.994          3       22584.6955      7528.2318
272     2005    10     0.00                 3       22584.6955      7528.2318
272     2005    11     6341.551          3       8381.545          2793.8483

Notice, however, that there's another subtle change in the answer. In choosing to start the time series exactly on 2005-07 (rather than the month when each employee had his/her first sale), the averages for the first couple of months for this employee are a little low. We can account for this, if we want, by basing the #allmonths table on each employee's hire date, for example. Yet another instance of "always pay attention to what you're actually measuring".

@bobbeauch
 

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.