So there’s all sorts of legacy behaviors in SQL Server that will keep me employable for years and years because it didn’t make sense when Microsoft started shipping SQL Server and there’s too many people to change it now ;).
RAND() is one of those things that behaves differently in the various commercial database vendors, and it takes some time to grok what’s happening.
Let’s say that I want to get a random set of rows from a table – that seems like a nice thing to do, right?
So I go write:
SELECT * FROM MyTable WHERE RAND() > 0.5
This seems like a great first attempt to get about half the rows back, assuming a good pseudo-random number implementation. However, you’ll notice that if you run this, the behavior isn’t quite what might be expected (at least the way I’ve described it in this example).
So when one executes this query, you’ll notice that sometimes you get all rows from your table and other times you’ll get none. But wait – functions are run per-row, right? Well, no, at least not in all cases. The complete details are beyond a single blog posting, but for largely historical reasons the RAND() function is actually executed per query, not per row. That means that, logically, you get the query from the plan cache, get it ready to run, and then run RAND() once, caching the value. Then, each and every row uses that cached value when evaluating any scalar logic.
So, for this example, either that initial call to RAND() returned a value greater than .5 or it didn’t. As such, you get all or no rows only.
SQL Server added TABLESAMPLE as a mechanism to address part of this need – retreiving a sample from a table. However, it isn’t arbitrarily composable, and the semantics used in the default system implementation in SQL 2005 are actually intended to be used for statistics generation – a slightly different goal than quality sampling.
But wait – a lot of functions DO execute per row. How can I tell which is which? Well, I am not sure that there is a publicly exposed way, other than trial and error, to determine this. I need to go do some more homework, as my memory is, well, random.
Even worse, some function computations are moved around in query trees, and so the actual number of executions may not be what you think as “per row”. For example, if you execute a filter before/after the scalar logic, then the results may be different. Put another way, if you use an index seek vs. a table scan, how many times do you expect any old function to be executed?
It’s enough to make your head spin!
The basic rules are that, as long as the optimizer knows about the semantic issues, it will try very hard to not move around functions who have different semantics when executed different numbers of times.
So, be careful with your functions, built-in, user-defined, or whatever exotic form gets dreamed up next. There’s a lot going on under the covers, and knowing a few details can help you more deeply understand what a query is doing.