I'm always leery when I hear people say "statistics show that…" followed by whatever their opinion is. Scientists do it. And your users probably do it too. I worked with a product called SAS once, on statistics for response time. Got some lovely reports and statistics, *from which other folks draw conclusions*. It's important to step back every once in a while, and make sure you're measuring what you think you're measuring. And that it really does add up. I've made that into kind of a koan "I don't want to hear your conclusion, I want to see your raw data". And, if I'm really interested, I'll do the math "by hand". Amazingly (well maybe not so amazingly) the conclusion doesn't always jibe with the raw data and/or what folks *think* they're measuring.
End of rant.
So SQL Server Denali CTP3 added some new analytical functions: PERCENT_RANK, CUME_DIST, PERCENTILE_DISC and PERCENTILE_CONT. The first three, I have no problem with. The Denali BOL lists how it works and I can confirm this using sample data. Good. The PERCENTILE_CONT function though…that's a little different.
BTW, There is a slight typo in BOL for PERCENT_RANK. It says "The range of values returned by PERCENT_RANK is greater than 0 and less than or equal to 1." Actually, PERCENT_RANK can, and is, in their example, sometimes *equal to 0*. It's CUME_DIST that is
greater than 0 and less than or equal to 1. Fine.
For PERCENTILE_CONT, the SQL Server BOL says: "Calculates a percentile based on a continuous distribution of the column value…
The result is interpolated and might not be equal to any of the specific values in the column." That's sufficiently vague,
interpolated how? And "might not be equal to any of the specific values in the column"? That's as opposed to PERCENTILE_DISC which is
always equal to one of the specific values of the column, by definition.
I've also read that "PERCENTILE_CONT(X) examines the percent_rank of values in a group until it finds one greater than or equal
to X." In two or three places on the web. Nope. Not true. Doesn't jibe with "might not be equal to any of the specific values in
the column".
So, I'm intrigued . And, last night, tried to figure it out by running a variation of BOL sample query.
USE AdventureWorks2008R2;
SELECT Name AS DepartmentName
, ph.Rate
,PERCENT_RANK() OVER(PARTITION BY Name ORDER BY ph.Rate) as Percent_rank
,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ph.Rate)
OVER (PARTITION BY Name) AS MedianCont
,CUME_DIST() OVER(PARTITION BY Name ORDER BY ph.Rate) as Cume_Dist
,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY ph.Rate)
OVER (PARTITION BY Name) AS MedianDisc
FROM HumanResources.Department AS d
INNER JOIN HumanResources.EmployeeDepartmentHistory AS dh
ON dh.DepartmentID = d.DepartmentID
INNER JOIN HumanResources.EmployeePayHistory AS ph
ON ph.BusinessEntityID = dh.BusinessEntityID
WHERE dh.EndDate IS NULL;
Let's look at a few groups in the raw data (the three columns that matter have headers):
Rate Percent Rank MedianCont
———————————————————————————————–
Executive 39.06 0 54.32695 0.25 48.5577
Executive 48.5577 0.333333333333333 54.32695 0.5 48.5577
Executive 60.0962 0.666666666666667 54.32695 0.75 48.5577
Executive 125.50 1 54.32695 1 48.5577
Tool Design 8.62 0 25 0.166666666666667 25.00
Tool Design 23.72 0.2 25 0.333333333333333 25.00
Tool Design 25.00 0.4 25 0.666666666666667 25.00
Tool Design 25.00 0.4 25 0.666666666666667 25.00
Tool Design 28.8462 0.8 25 0.833333333333333 25.00
Tool Design 29.8462 1 25 1 25.00
For "Executive", the value is interpolated, like the BOL says. Not a value in the set. For "Tool Design", the MedianCont (PERCENTILE_CONT(0.5)) is 25,
a value in the set, but not a value where percent rank is > 0.5 (its 0.4). Oh. So I ripped out the rows that produced the
resultset and played with these for a while, adding new rows to each set and watching how the results changed. No joy. After yet
another web search (I DID use web search, I swear, but missed this one the first 10 times) I came across this in Oracle Database
SQL Reference:
"The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile
value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the
rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The
final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN =
CEILING(RN) and FRN = FLOOR(RN).
The final result will be:
If (CRN = FRN = RN) then the result is
(value of expression from row at RN)
Otherwise the result is
(CRN – RN) * (value of expression for row at FRN) +
(RN – FRN) * (value of expression for row at CRN)
".
SQL Server's numbers do agree with Oracle's for the "demo" (that's scott/tiger) database examples. I tried them (yes, I really
have a "demo" database on Denali, it's lamer than even pubs. But they've replaced it now, with more robust samples DBs). So
let's work these two out.
For "Tool Design":
declare @p float = 0.5
declare @n int = 6
select 1+ (@p*(@n-1)) –3.5
select ceiling(1+ (@p*(@n-1))) –4
select floor(1+ (@p*(@n-1))) –3
— value at row 3 = 25,value at row 4 = 25
select .5*25 + .5*25 — 25
25 is the right answer
For "Executive":
declare @p float = 0.5
declare @n int = 4
select 1+ (@p*(@n-1)) –2.5
select ceiling(1+ (@p*(@n-1))) –3
select floor(1+ (@p*(@n-1))) –2
— value at row 2 = ,value at row 3 =
select .5*48.5577 + .5*60.0962 — 54.32695
54.32695 is the right answer
Got it? If you think you have it, run the sample query and work out dept "Finance". Then (if you're up late like I was, and I worked out most/all of the groups in the sample query), go to
sleep. You've earned it. And…you may not ever use PERCENTILE_CONT in your day-to-day work.
[Rant on] To reiterate, when looking at reports that contain statistical calculations, make sure you know what the statistic means. And
check it to make sure it DOES mean that, especially if you think the numbers look "weird" (ie, don't jibe with the description). We aren't all rocket
scientists. And rocket scientists may not have a background in statistics anyhow.
But I'll bet they (like everyone) have opinions. And are hankerin' to "prove" them.
@bobbeauch