Denali window clause followup question

So, quickly on the heels of the first window clause and last_value() question, came a followup:

OK smartie, why does last_value work fine here? I didn't have to change from the default window. What gives?

select SalesPersonID, SalesOrderID,
 first_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID) as FirstOrderForSalesPerson,
 last_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID) as LastOrderForSalesPerson
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL
order by SalesPersonID, SalesOrderID

It's close to the last example, except that we're ordering by SalesPersonID instead of SalesOrderID. And, partitioning by (works like group by except that we keep all the detail rows in each group) SalesPersonID. So why DOES it produce the "right" answer for each SalesPersonID?

Remember that the query above is equivalent to:

select SalesPersonID, SalesOrderID,
 first_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID
                                range between unbounded preceding and current row) as FirstOrderForSalesPerson,
 last_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID
                               range between unbounded preceding and current row) as LastOrderForSalesPerson
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL
order by SalesPersonID, SalesOrderID

And it gets the intuitive "right" answer because of the way the "range" windowing spec handles ties. Since the order by clause specifies "SalesPersonID", all the rows for the same SalesPersonID are ties. The window frame with range includes all rows with the same value (all ties) as the part of the frame. So, in the query above, SalesPersonID X has multiple (tied) rows in the window, range considers all of them.

Contrast to this, using the "rows" windowing spec, which *doesn't* consider multiple tied rows as part of the window, we're back to our "weird" answer.

select SalesPersonID, SalesOrderID,
 first_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID
                                rows between unbounded preceding and current row) as FirstOrderForSalesPerson,
 last_value(SalesOrderID) over(partition by SalesPersonID order by SalesPersonID
                               rows between unbounded preceding and current row) as LastOrderForSalesPerson
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL
order by SalesPersonID, SalesOrderID

A different way to prove to yourself that RANGE considers ties part of the frame and ROWS doesn't is to use the use the SUM aggregate instead of FIRST/LAST_VALUE().

— all of SumOfTotal for a specific SalesPersonID are equal with RANGE
— SumOfTotal is equal for each row, each SalesPersonID
select SalesPersonID, SalesOrderID, TotalDue,
 sum(TotalDue) over(partition by SalesPersonID order by SalesPersonID) as SumOfTotal
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

— you get a "running total" when using ROWS
select SalesPersonID, SalesOrderID, TotalDue,
 sum(TotalDue) over(partition by SalesPersonID order by SalesPersonID
                    rows between unbounded preceding and current row) as SumOfTotal
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

However, notice that the running total isn't necessarily by SalesOrderID becuase without an "order by" in the OVER clause, the spec doesn't guarentee ordering by both columns. Note for example, that in the query above, for SalesPersonID 281, SalesOrderID 48327 comes after 48370 (at least it does on my machine, without additional indexes on SalesOrderHeader table).

If you truly want running total by SalesOrderID, in SalesOrderID order, add SalesOrderID to the "ORDER BY" in the OVER clause. When you add SalesOrderID to ORDER BY, now there are no ties in ordering. So either ROWS or RANGE will do the trick.

— same answer as next query
select SalesPersonID, SalesOrderID, TotalDue,
 sum(TotalDue) over(partition by SalesPersonID order by SalesPersonID, SalesOrderID
                    rows between unbounded preceding and current row) as SumOfTotal
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

— same answer as previous query, uses RANGE by default
select SalesPersonID, SalesOrderID, TotalDue,
 sum(TotalDue) over(partition by SalesPersonID order by SalesPersonID, SalesOrderID) as SumOfTotal
from Sales.SalesOrderHeader
where SalesPersonID IS NOT NULL

Hope this helps, Bob

@bobbeauch

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.