So there’s all sorts of legacy behaviors in SQL Server that will keep me employable for years and years because it didn’t make sense when Microsoft started shipping SQL Server and there’s too many people to change it now
.
RAND() is one of those things that behaves differently in the various commercial database vendors, and it takes some time to grok what’s happening.
Let’s say that I want to get a random set of rows from a table – that seems like a nice thing to do, right?
So I go write:
SELECT * FROM MyTable WHERE RAND() > 0.5
This seems like a great first attempt to get about half the rows back, assuming a good pseudo-random number implementation. However, you’ll notice that if you run this, the behavior isn’t quite what might be expected (at least the way I’ve described it in this example).
So when one executes this query, you’ll notice that sometimes you get all rows from your table and other times you’ll get none. But wait – functions are run per-row, right? Well, no, at least not in all cases. The complete details are beyond a single blog posting, but for largely historical reasons the RAND() function is actually executed per query, not per row. That means that, logically, you get the query from the plan cache, get it ready to run, and then run RAND() once, caching the value. Then, each and every row uses that cached value when evaluating any scalar logic.
So, for this example, either that initial call to RAND() returned a value greater than .5 or it didn’t. As such, you get all or no rows only.
SQL Server added TABLESAMPLE as a mechanism to address part of this need – retreiving a sample from a table. However, it isn’t arbitrarily composable, and the semantics used in the default system implementation in SQL 2005 are actually intended to be used for statistics generation – a slightly different goal than quality sampling.
But wait – a lot of functions DO execute per row. How can I tell which is which? Well, I am not sure that there is a publicly exposed way, other than trial and error, to determine this. I need to go do some more homework, as my memory is, well, random.
Even worse, some function computations are moved around in query trees, and so the actual number of executions may not be what you think as “per row”. For example, if you execute a filter before/after the scalar logic, then the results may be different. Put another way, if you use an index seek vs. a table scan, how many times do you expect any old function to be executed?
It’s enough to make your head spin!
The basic rules are that, as long as the optimizer knows about the semantic issues, it will try very hard to not move around functions who have different semantics when executed different numbers of times.
So, be careful with your functions, built-in, user-defined, or whatever exotic form gets dreamed up next. There’s a lot going on under the covers, and knowing a few details can help you more deeply understand what a query is doing.
Thanks,
Conor Cunningham
3 Responses to Wrapping my head around RAND() in SQL Server
Conor,
Interesting topic. I tried unsuccessfully with two approaches to get different RAND() values for each row:
- Cross product join between the base rows and a separate "table" of (Select Rand() As RandomValue) As b – see the following attempted example using the AdventureWorks DB:
Select a.*, b.RandomValue
From Person.ContactType a
Cross Join (Select Rand() RandomValue) As b
The query has the same results as you reported – same Rand() value for every row (evaluated once per query instead of once per row as desired).
I also tried it reversing the order of the join, with similar results:
Select a.*, b.RandomValue
From (Select Rand() RandomValue) As b
Cross Join Person.ContactType As a
- Use the new Cross Apply feature in SQL 2005 – see the following attempted examples using the AdventureWorks DB and both join / apply sequences:
Select a.*, b.RandomValue
From Person.ContactType a
Cross Apply (Select Rand() As RandomValue) as b
Select a.*, b.RandomValue
From (Select Rand() As RandomValue) as b
Cross Apply Person.ContactType a
These queries had the same results as you reported – same Rand() value for every row (evaluated once per query instead of once per row as desired) – similar results as the following query:
Select *, Rand() As RandomValue
From Person.ContactType
The Rand function is a non-deterministic function. I wonder if this could be the root cause for evaluating at the query level versus the row level. As an experiment, I tried the following query using the GetDate() function (also non-deterministic):
Select *, GetDate() As RandomValue
From Person.ContactType
The results were similar to the Rand results – evaluated per query versus per row.
See the following Books Online link regarding deterministic and non-deterministic functions in SQL Server:
http://msdn2.microsoft.com/en-us/library/ms178091.aspx
Let us hear if you come up with a creative solution to this problem.
Scott R.
non-determinism is actually not _quite_ the limitation that causes this behavior, but conceptually you are close. I think you’d find that not all non-deterministic functions are executed once per query.
However, some functions are marked to be executed once per query, and this is one of them.
In terms of "working around this problem", TABLESAMPLE can address part of the original scenario, and you can always wrap a call in an user-defined function, as many of these are not reasoned about deeply by the query processor. So, build a user-defined function that calls the system’s random() routine and you can likely get a function that returns random values per row. I don’t believe that there are exact guarantees about movement of the UDFs, however, especially if you mark them as deterministic when you create them.
So, while it may work for a specific query and a specific plan, that wouldn’t work in general.
Conor
While working on a similar problem, I found that NEWID() worked for randomization… perhaps you could create a function that converts this into a random number? I threw together a quick example (so it may have issues).
References: http://www.sqlteam.com/forums/topic.asp?TOPIC_ID=81849
IF OBJECT_ID(‘vwGuid’) IS NOT NULL
DROP VIEW dbo.vwGuid
GO
CREATE VIEW dbo.vwGuid AS
SELECT Id = NEWID()
GO
IF OBJECT_ID(‘myRand’) IS NOT NULL
DROP FUNCTION dbo.myRand
GO
CREATE FUNCTION dbo.myRand(@Min INT, @Max INT) RETURNS INT AS
BEGIN
DECLARE @BinaryFloat BINARY(8)
SELECT @BinaryFloat = CAST(Id AS BINARY) FROM vwGuid
DECLARE @Part TINYINT,
@PartValue TINYINT,
@Mask TINYINT,
@Mantissa FLOAT,
@Exponent SMALLINT,
@Bit TINYINT,
@Ln2 FLOAT,
@BigValue BIGINT,
@RandomNumber FLOAT
SELECT @Part = 1,
@Mantissa = 1,
@Bit = 1,
@Ln2 = LOG(2),
@BigValue = CAST(@BinaryFloat AS BIGINT),
@Exponent = (@BigValue & 0x7ff0000000000000) / EXP(52 * @Ln2)
WHILE @Part <= 8
BEGIN
SELECT @Part = @Part + 1,
@PartValue = CAST(SUBSTRING(@BinaryFloat, @Part, 1) AS TINYINT),
@Mask = CASE WHEN @Part = 2 THEN 8 ELSE 128 END
WHILE @Mask > 0
BEGIN
IF @PartValue & @Mask > 0
SET @Mantissa = @Mantissa + EXP(-@Bit * @Ln2)
SELECT @Bit = @Bit + 1,
@Mask = @Mask / 2
END
END
SET @RandomNumber = CASE @Exponent WHEN 0 THEN 0 ELSE CAST(@Exponent AS FLOAT) / 2047 END
RETURN CAST((@RandomNumber * (@Max – @Min)) + @Min AS INT)
END
GO
–from your example
Select *, .dbo.myRand(1, 10) As RandomValue
From Person.ContactType