In the previous post, I insinuated that allowing LAG/LEAD to be sensitive to value RANGEs might help with series of data with missing values. It won't do that unless there's some logic that allows the offset that LAG/LEAD uses to be calculated on a row-by-row basis (i.e. LAG/LEAD uses an expression based on the window values). That's a little too much to expect; notice I've updated the original post to back off a bit on that one. I'll need my densified data for that.
However, a common scenario that would be helped by an expansion of use of RANGE in the window specification is rolling totals/rolling balances. Here's an example.
Suppose I wanted to calculate a 3-month rolling average (or rolling total) of sales on a per-employee basis. Looking at an employee who doesn't make at least one sale every month, I get a skewed total/average. The totals don't reflect the last three calendar months, but instead they reflect the last three months in which the employee had a nonzero sales total. Skews the figures upward a little….
SELECT Employee,
[Year] ,
[Month] ,
EmployeeTotal AS SalesThisMonth,
COUNT(*) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS NumberOfMonths,
SUM(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthTotal,
AVG(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthAverage
FROM EmployeeSalesByMonth
WHERE Employee = 272
ORDER BY Employee, [Year], [Month];
SalesThisMonth Number ThreeMonthTotal ThreeMonthAverage
272 2005 8 20544.7015 1 20544.7015 20544.7015
272 2005 9 2039.994 2 22584.6955 11292.3477
272 2005 11 6341.551 3 28926.2465 9642.0821
272 2006 2 61206.4782 3 69588.0232 23196.0077
272 2006 3 18307.746 3 85855.7752 28618.5917
272 2006 4 33406.7043 3 112920.9285 37640.3095
Suppose I had a way to specify a time interval and use RANGE instead of ROWS. Something like: "RANGE BETWEEN '2 MONTHS' PRECEEDING AND CURRENT ROW" where '2 MONTHS' represents a time-interval data type. Then I'd get the "non-skewed" answer. Currently SQL Server Denali doesn't even allow that form the RANGE specification (RANGE BETWEEN <unsigned value specification> PRECEDING) with any data type currently. Instead, I can use the densified data (see previous post) with the rows window and get that "non-skewed" answer.
WITH SalesByMonth AS
(
SELECT Employee,
DATEPART(yyyy,dt) as Year,
DATEPART(MONTH,dt) as Month,
ISNULL(EmployeeTotal, 0) as EmployeeTotal
FROM #allMonths m
OUTER APPLY (
SELECT EmployeeTotal
FROM EmployeeSalesByMonth e
WHERE m.Employee = e.Employee
AND DATEPART(yyyy,dt) = [Year]
AND DATEPART(mm,dt) = [Month]
) AS t
)
SELECT Employee,
[Year] ,
[Month] ,
EmployeeTotal AS SalesThisMonth,
COUNT(*) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS NumberOfMonths,
SUM(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthTotal,
AVG(EmployeeTotal) OVER (PARTITION BY Employee ORDER BY [Year], [Month] ROWS 2 PRECEDING) AS ThreeMonthAverage
FROM SalesByMonth
WHERE Employee = 272
ORDER BY Employee,
[Year],
[Month];
GO
SalesThisMonth Number ThreeMonthTotal ThreeMonthAverage
272 2005 7 0.00 1 0.00 0.00
272 2005 8 20544.7015 2 20544.7015 10272.3507
272 2005 9 2039.994 3 22584.6955 7528.2318
272 2005 10 0.00 3 22584.6955 7528.2318
272 2005 11 6341.551 3 8381.545 2793.8483
Notice, however, that there's another subtle change in the answer. In choosing to start the time series exactly on 2005-07 (rather than the month when each employee had his/her first sale), the averages for the first couple of months for this employee are a little low. We can account for this, if we want, by basing the #allmonths table on each employee's hire date, for example. Yet another instance of "always pay attention to what you're actually measuring".
@bobbeauch
Sorry, comments are closed for this post.