I'm always leery when I hear people say "statistics show that…" followed by whatever their opinion is. Scientists do it. And your users probably do it too. I worked with a product called SAS once, on statistics for response time. Got some lovely reports and statistics, ***from which other folks draw conclusions***. It's important to step back every once in a while, and make sure you're measuring what you think you're measuring. And that it really does add up. I've made that into kind of a koan "I don't want to hear your conclusion, I want to see your raw data". And, if I'm really interested, I'll do the math "by hand". Amazingly (well maybe not so amazingly) the conclusion doesn't always jibe with the raw data and/or what folks ***think*** they're measuring.

End of rant.

So SQL Server Denali CTP3 added some new analytical functions: PERCENT_RANK, CUME_DIST, PERCENTILE_DISC and PERCENTILE_CONT. The first three, I have no problem with. The Denali BOL lists how it works and I can confirm this using sample data. Good. The PERCENTILE_CONT function though…that's a little different.

BTW, There is a slight typo in BOL for PERCENT_RANK. It says "The range of values returned by PERCENT_RANK is greater than 0 and less than or equal to 1." Actually, PERCENT_RANK can, and is, in their example, sometimes *equal to 0*. It's CUME_DIST that is

greater than 0 and less than or equal to 1. Fine.

For PERCENTILE_CONT, the SQL Server BOL says: "Calculates a percentile based on a continuous distribution of the column value…

The result is interpolated and might not be equal to any of the specific values in the column." That's sufficiently vague,

interpolated how? And "might not be equal to any of the specific values in the column"? That's as opposed to PERCENTILE_DISC which is

always equal to one of the specific values of the column, by definition.

I've also read that "PERCENTILE_CONT(X) examines the percent_rank of values in a group until it finds one greater than or equal

to X." In two or three places on the web. Nope. Not true. Doesn't jibe with "might not be equal to any of the specific values in

the column".

So, I'm intrigued . And, last night, tried to figure it out by running a variation of BOL sample query.

USE AdventureWorks2008R2;

SELECT Name AS DepartmentName

, ph.Rate

,PERCENT_RANK() OVER(PARTITION BY Name ORDER BY ph.Rate) as Percent_rank

,PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY ph.Rate)

OVER (PARTITION BY Name) AS MedianCont

,CUME_DIST() OVER(PARTITION BY Name ORDER BY ph.Rate) as Cume_Dist

,PERCENTILE_DISC(0.5) WITHIN GROUP (ORDER BY ph.Rate)

OVER (PARTITION BY Name) AS MedianDisc

FROM HumanResources.Department AS d

INNER JOIN HumanResources.EmployeeDepartmentHistory AS dh

ON dh.DepartmentID = d.DepartmentID

INNER JOIN HumanResources.EmployeePayHistory AS ph

ON ph.BusinessEntityID = dh.BusinessEntityID

WHERE dh.EndDate IS NULL;

Let's look at a few groups in the raw data (the three columns that matter have headers):

Rate Percent Rank MedianCont

———————————————————————————————–

Executive 39.06 0 54.32695 0.25 48.5577

Executive 48.5577 0.333333333333333 54.32695 0.5 48.5577

Executive 60.0962 0.666666666666667 54.32695 0.75 48.5577

Executive 125.50 1 54.32695 1 48.5577

Tool Design 8.62 0 25 0.166666666666667 25.00

Tool Design 23.72 0.2 25 0.333333333333333 25.00

Tool Design 25.00 0.4 25 0.666666666666667 25.00

Tool Design 25.00 0.4 25 0.666666666666667 25.00

Tool Design 28.8462 0.8 25 0.833333333333333 25.00

Tool Design 29.8462 1 25 1 25.00

For "Executive", the value is interpolated, like the BOL says. Not a value in the set. For "Tool Design", the MedianCont (PERCENTILE_CONT(0.5)) is 25,

a value in the set, but not a value where percent rank is > 0.5 (its 0.4). Oh. So I ripped out the rows that produced the

resultset and played with these for a while, adding new rows to each set and watching how the results changed. No joy. After yet

another web search (I DID use web search, I swear, but missed this one the first 10 times) I came across this in Oracle Database

SQL Reference:

"The result of PERCENTILE_CONT is computed by linear interpolation between values after ordering them. Using the percentile

value (P) and the number of rows (N) in the aggregation group, we compute the row number we are interested in after ordering the

rows with respect to the sort specification. This row number (RN) is computed according to the formula RN = (1+ (P*(N-1)). The

final result of the aggregate function is computed by linear interpolation between the values from rows at row numbers CRN =

CEILING(RN) and FRN = FLOOR(RN).

The final result will be:

If (CRN = FRN = RN) then the result is

(value of expression from row at RN)

Otherwise the result is

(CRN – RN) * (value of expression for row at FRN) +

(RN – FRN) * (value of expression for row at CRN)

".

SQL Server's numbers do agree with Oracle's for the "demo" (that's scott/tiger) database examples. I tried them (yes, I really

have a "demo" database on Denali, it's lamer than even pubs. But they've replaced it now, with more robust samples DBs). So

let's work these two out.

For "Tool Design":

declare @p float = 0.5

declare @n int = 6

select 1+ (@p*(@n-1)) –3.5

select ceiling(1+ (@p*(@n-1))) –4

select floor(1+ (@p*(@n-1))) –3

— value at row 3 = 25,value at row 4 = 25

select .5*25 + .5*25 — 25

25 is the right answer

For "Executive":

declare @p float = 0.5

declare @n int = 4

select 1+ (@p*(@n-1)) –2.5

select ceiling(1+ (@p*(@n-1))) –3

select floor(1+ (@p*(@n-1))) –2

— value at row 2 = ,value at row 3 =

select .5*48.5577 + .5*60.0962 — 54.32695

54.32695 is the right answer

Got it? If you think you have it, run the sample query and work out dept "Finance". Then (if you're up late like I was, and I worked out most/all of the groups in the sample query), go to

sleep. You've earned it. And…you may not ever use PERCENTILE_CONT in your day-to-day work.

[Rant on] To reiterate, when looking at reports that contain statistical calculations, make sure you know what the statistic means. And

check it to make sure it DOES mean that, especially if you think the numbers look "weird" (ie, don't jibe with the description). We aren't all rocket

scientists. And rocket scientists may not have a background in statistics anyhow.

But I'll bet they (like everyone) have opinions. And are hankerin' to "prove" them.

@bobbeauch