(I got CTP 6 running on another machine, so I'm working except for parallel plans now since it's only a single-proc.  You guys will have to wait on that operator of the day article for now :)

Kudos to my former teammembers for getting filtered indexes into CTP6.  It's pretty darn neat, and I'll show you a few tidbits that are interesting.

So, filtered indexes are not really a new "feature" in some sense.  You can do everything in a filtered index with an indexed view.  Of course, indexed view matching is only supported in the Enterprise edition of the software, so perhaps not everyone has seen those benefits. 

Indexed views are a tricky feature - generalized tree pattern matching is hard (read: CPU expensive), and if you've looked in the BOL the list of restrictions is so long that reminds me of filling out tax forms.  However, the other side effect that occurs when you have indexed views is that the optimizer has to go through additional phases of its search in order to apply them.  The optimizer has a couple of buckets of rules it can run, and most plans find a nice plan in the first or second round of rules.  Generalized indexed view matching is restricted to the last set of buckets, which usually means that there are a lot more rules that have to run before that view can be matched.  The bottom line is that the compilation cost isn't bad for a single query (usually), but it's the sort of thing that can make the difference between an application that can be used in UI response-time requirements or not.

Enter the filtered index.  This is a recognition that there are a lot of indexed views that don't need tons and tons of fancy equations, joins, etc.  These are single-tabled views that have simple predicates.  Once you get into this ballpark, you can bolt this on to the super-efficient index matching code and you can enable a whole new class of application from what you could build previously.  This is why I am excited about this feature.

I haven't looked to see where this feature will fall into the SKU matrix yet, and I'm sure that they're still pondering that very question.  However, you guys should play with this on CTP6 - it's nifty!

So, first things first.  Let's build one of these guys and see if we can match it:

create database t1
use t1
drop table t1
create table t1(col1 int, col2 nchar(2000), col3 time)

create index i1 on t1(col1) where col1 > 5 and col1 < 20
declare @p int 
set @p =0
while @p < 20000
begin
insert into t1(col1) values (rand()*10000)
set @p=@p+1
end
ok so I've filled this table up with a lot of useless data and created a nice little filtered index.
select * from t1  where col1 > 5 and col1 < 20
  |--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1000], [Expr1007]) OPTIMIZED WITH UNORDERED PREFETCH)
       |--Compute Scalar(DEFINE:([Expr1006]=BmkToPage([Bmk1000])))
       |    |--Index Scan(OBJECT:([t1].[dbo].[t1].[i1]))
       |--RID Lookup(OBJECT:([t1].[dbo].[t1]), SEEK:([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD)

select * from t1  where col1 > 5 and col1 < 10
  |--Nested Loops(Inner Join, OUTER REFERENCES:([Bmk1000]) OPTIMIZED)
       |--Compute Scalar(DEFINE:([Expr1006]=BmkToPage([Bmk1000])))
       |    |--Index Seek(OBJECT:([t1].[dbo].[t1].[i1]), SEEK:([t1].[dbo].[t1].[col1] < (10)) ORDERED FORWARD)
       |--RID Lookup(OBJECT:([t1].[dbo].[t1]), SEEK:([Bmk1000]=[Bmk1000]) LOOKUP ORDERED FORWARD)

so in the first example I've created a query against the table that directly matches the index condition.  It matches and even generates an index scan (slightly faster since it doesn't have to navigate down the b-tree to start returning rows).

The second example is interesting because is a proper subset of the index.  The indexed view matching code can do subsumption as well.  It generates a seek in this case and returns only part of the filtered index.

Both cases do RID lookups back to the heap since I did SELECT * and the index is not "covering".

The other data point I'll leave you with this evening is that the showplan_xml has something interesting in it:

StmtSimple StatementText="select * from t1 where col1 > 5 and col1 < 10" StatementId="1" StatementCompId="1" StatementType="SELECT" StatementSubTreeCost="0.0474183" StatementEstRows="7" StatementOptmLevel="FULL" StatementOptmEarlyAbortReason="GoodEnoughPlanFound" ParameterizedText="(@1 tinyint,@2 tinyint)SELECT * FROM [t1] WHERE [col1]>@1 AND [col1]<@2">


Well, there are two interesting things.  First, StatementOptmLevel=FULL means that we didn't get a trivial plan.  There is a parameterized text field, but I don't think that is being used here.  these two queries are showing up as different plans in the plan cache (and obviously they have different plans). So, absent forced autoparam, I don't think that trivial plan is working on these guys.  This makes sense since there is a cost-based plan choice.

That's it for me tonight guys.  Happy Querying!

Conor

PS: Thanks for all of the comments and posts trying to help me get back to a working state.  I've tried many of the suggestions without luck, but I'm going to try playing with the orca tool, perhaps tomorrow or so, to see if I can get my main machine back up and working.

Categories:

oh well.  I've spent a fair amount of my day attempting to manually hack out a lot of the various registry keys associated with SQL Server 2008 in the hopes of upgrading from the November 2007 CTP to the Feb 2008 CTP on my main machine (Vista x64 box).  Since I received questions about whether I've gotten this to work, I figure I'd blog about where I am so you can compare with your own experiences.

Bottom line - No luck.  Here's the error I'm at now:

D:\temp\sqlctp6\servers>setup
The following error occurred:
MsiGetProductInfo failed to retrieve ProductVersion for package with Product Code = '{00E75F61-A126-4CE1-90B8-42295052F1AC}'. Error code: 1605.
Error result: 0xEEBAE95A
Result facility code: 0x6BA
Result error code: 0xE95A

I've learned a number of things - specifically that the non-SQL Server core Engine pieces have TONs of registry keys.  It seems that they took a shotgun and reserved a fraction fo the registry - SQLDMO and the DTS engine components seemed to be big offenders.

Either way, I'm getting blocked by my old friend the Windows Installer aka MSI.  Funnily enough, I've killed every key like this in the registry.  I'm also watching (via procmon.exe) and it's not asking for a registry key like this.  So, I guess MSI has stored something away in another data store and I haven't been able to find that to clean it up.  I've never had much patience for that thing.

I guess I'm in a bit of a reflective mood right now, as I am basically unable to install a product I had some hand in helping to write.  Certainly it's a beta, and I'm not upset that it has bugs in the installer per se - I'm willing to accept that.  What I'm less willing to accept is that the system can get into a state that is so obviously *&%$$%ed up and from which there is no real recovery path.

Let's look at where things failed here:
1. MSI storing information outside of the registry that impacts whether it runs or not (ie whether I can install or not).  No documentation on what to do when things get away from the good case.
2. GUIDs - customers don't care about them.. I don't care about them.  I can't remember numbers that long.  Why am I dealing with this?  I'm happy if you give me a number that I can pass to some other program, but this one seems not so useful.
3. The error message is not "actionable".  What do I do to fix this?  Where should I read additional information?
4. The "Error result" in this example.  Now, most people would have given up and called support... Instead I said "I should read up on this API and see what it says".  So I google for it (because loading help takes SO darn long and it asks me all sorts of privacy policy and other nonsense before I can get what I want AND it's not installed, of course ;) and I find the MSDN page on it - great.  It says nothing about this.
5. On the MSDN page, the errors are only listed using the C header literals - no hex codes on the page. grrr.  The installer didn't translate it for me back to the constant in the help page.
6. So I fire up Visual Studio 2008 and go to the error lookup tool, and I enter this code.  No good - it's not listed anywhere.  That seems bad.
7. I go back to the MSDN page, figure out that I should look in MSI.H.  Luckily, I have the Vista Platform SDK installed, so I go find that file.  Then I determine it's not even listed there either.
8. google on the error code by itself yields nothing of interest

So I'm left asking "Why is all of this so hard?".  The bottom line is that it SHOULDN'T be. 

Application design is an important aspect of any system.  As systems get more complex, it obviously becomes more challenging to make everything work.  It's important to determine the various use cases for your program, whatever it is, and then determine how you are going to manage the customer experience through all the steps of that process.  Here, the installation process failed on multiple levels and through multiple components.  The experience should be "your installation failed.  Here's how you can clean it up".  Even if that requires manual steps (it *is* a beta), there should be a path and at the end of that path should be a happy customer with a working computer installation.

I'm obviously left with options like "reinstall my OS", "install a VM" (which I won't do because I play games on this machine), or "find another machine for CTP6", which may be what I end up doing.

Conor finds himself longing for the days of .ini files all of a sudden.  he also apparently finds himself speaking in the third person about himself ;).  the old days of INI files would have been pretty easy to fix a failed install - "delete this directory and reinstall".

I hope this helps the others having trouble with CTP6 installation.  I'll see what I can do tomorrow to get a machine set up with SQL 2008 CTP 6.

Conor Cunningham

Categories:

Well, my dev box is in a bad state now - can't install the new CTP and I had to start manually hacking on the old one.

I may have to install SQL Server 2008 CTP 8 on a VM or on another machine.

So, I apologize for delays in posting up examples to answer your questions.

The lesson of the day is that one should just not trust installers anymore ;).

Conor

Categories:

One of the earlier comments I received asked about costing of T-SQL and CLR functions.  More specifically, they want to know how they are costed and why it is so low.

Let's create a motivating example.  I'll create a table with some size to it (I add a binary(2000) column to make each page take at least 2000 bytes, and this means that about 4 rows fit on a page.  So, if I add 100,000 rows to a table, we're talking about enough IOs to show up in the costing functions).
use t1

create table t3(col1 int, col2 binary(2000))
insert into t3(col1, col2) values (2, 0x1234)
insert into t3(col1, col2) values (3, 0x1234)
declare @p int
set @p=0
while @p < 100000
begin
insert into t3(col1, col2) values (rand()*1000, 0x1234)
set @p=@p+1
end


create function f3(@p1 int) RETURNS int
BEGIN
  return (select count(col1) from t3 where col1=@p1)
END

select dbo.f3(2)
(I apologize for my lack of real-world examples.. After you've done this a few thousand times you resort to the shortest names you can type ;).

ok, so we have a table and we create a T-SQL scalar function that just runs an aggregate to find the number of occurances of a particular value in the table.   Since there are no indees on t3 this will be a table scan and it will read lots of pages into the buffer pool.

So let's look at the costing output for each of the elements of this query.





Above is the plan for the base query (select dbo.f3(2)).  It emits a dummy row from a constant table and then does a "compute scalar" operation to run the function and generate a 1 row, 1 column result.

That cost doesn't look so big...

Now let's look at the plan for the scalar subquery (well, the top of it since we're looking for the total cost)






Well, the laws of physics appear to be violated. 

You can try whatever cost query you want in the function - unfortunately, these aren't really costed in the SQL Server QP (at least through 2005 and what I have seen so far on the 2008 CTPs).  Back when SQL Server 7 was being created, scalars were generally always cheap and they really didn't need costs.  There's a small amount of CPU cost added in a compute scalar and that generally worked for OLTP plans.  As SQL Server started being used in larger and larger scenarios, this issue started to manifest.

As SQL Server has added more features, some of them can be arbitrarily expensive and this issue can show up in plan selection as a problem.  Areas where you are likely to see this are T-SQL functions, CLR functions, string functions on varchar(max) and friends, and I'm sure that there are others.  However, it's worth noting that this won't be a problem in a lot of cases, and often workarounds are possible that may even be better design choices than doing what I've done in this example.  I don't recommend that you "hide" subqueries in scalar functions like this precisely because the optimizer can't see and cost them properly, even if they are as simple as this one.

There are a few tricky problems in "fixing" this (costing the scalar functions in line with their runtime).  First, this isn't exactly the sort of thing that people are going to beat down the gates at Microsoft to get fixed.  However, it is an area where the model doesn't work right, and when a customer does hit this they are often very far into a deployment or POC and can't exactly redesign their application easily.  Second, there are actually lots of really complex things that happen with the placement of computescalars and how they are evaluated in the optimizer and execution engine.  Finally, the SQL Server QP is set up to do relational transformations (A join B is equivalent to B join A, etc.), and the computation fo scalars isn't really a true relational operation in this algebra.  This makes true costing of compute scalars always something that is different than costing a join or a filter.  Third, changing the costing of such a basic operator will probably impact the plan selection for basically EVERY query, as their costs will change slightly - that's a big risk to fix a "little" bug. The last tricky bit in fixing this has to do with the way that T-SQL functions like this get exposed into the optimizer.  Not all queries really have known costs a priori.  The example function that I've given you does have a fixed cost, under some definition, but if I start putting procedural logic into the function, then the actual executed path depends on the runtime data, and that means that the cost is based on something you haven't interpreted yet.  So, there are some at least reasonable explanations as to why this issue persists to this day.

So my advice is that if you have cases where your scalar functions are undercosted and you think it impacts plan selection, then you need to contact Microsoft and let them know.  I have seen some cases of this, but I think that it's perhaps not the most common problem and that might have some impact on whether it gets addressed in the future.  I *guarantee* you that the people who work there are highly skilled and passionate about solving customer issues, as I worked with them, so I suspect that if there's enough squeaky wheels then this is something that may be addressed.

Your other option is to send me money, I guess ;).

I hope that gives you a bit of background on the issues you need to know when examining your query plans when you have expensive scalars in them.

Conor Cunningham




Categories:

ok so I haven't written anything about statistics yet on the blog, and Paul fielded a question from someone asking about details of the statistics implementation in SQL Server.  Luckily, I know a little bit about that.

So, this post will cover the published background material that I think everyone should read about SQL Server's statistics implementation, and then we can have a more detailed conversation about the nitty gritty fun details of them.

So this was written for SQL 2000 but the basic details apply to SQL Server 2005 and 2008.  I'd read this first.
http://msdn2.microsoft.com/en-us/library/aa902688(SQL.80).aspx

There was an additional white paper written for SQL Server 2005 that covers the new features that were added.  Lots of extensions where we refined the features from 2000 and made them work more effectively.
http://www.microsoft.com/technet/prodtechnol/sql/2005/qrystats.mspx


If you have questions about statistics, feel free to post them up.  If you wish to question statistics, then well I guess you should start with DBCC SHOW_STATISTICS. 

Conor Cunningham

Categories:

So I like kung-fu movies.  I don't care how bad the acting is, as long as there is some master/relative/friend who has been killed who needs avenged.  Often good movies are made this way.

In software, a good idea often comes out of getting tired of seeing bugs on a sore subject, often not even your fault.  So, in a fit of rage, I locked myself in my office for a few weeks and came up with the streaming table-valued function, since used as the proverbial duct-tape within the QP for all sorts of unforseen features.  However, it has a "walking on rice paper without leaving footprints" vibe, so I will tell you a bit about this guy ;).

(I did the QO part for this operator and a chunk of the architecture for it - there were others who worked on it too, however).

So a query operator takes rows in the bottom and spits rows out the top (yes, none of the exposed query plan graphs are drawn this way, but that's how people talk about it ;).  Some operators take more than one input - most of the ones with 2 inputs are called "joins".  Some can take an arbitrary number of inputs (usually called UNIONs).  Some take parameters (Compute Scalar).  most don't take parameters, however.

After you've done the 10 or so common operators in a query processor, there's a lot of specialty problems that don't really fit well into those operators.  So people end up trying all sorts of crazy paths to get the data they need out to customers.  Exuberence can often cause a few bugs, however, so eventually something needed to be done.

So, the "Streaming Table Valued Function" (STVF) is a relational operator (takes in rows and passes out rows).  In this case, it takes in some number of scalar parameters (like any good function) and spits out zero or more rows.  The beauty of this operator is that anyone can program anything to get returned, as long as they return back rows - it's like a user-defined operator for internal folks to write features for customers.  Amazingly, there are a lot of these, and you can see them in Dynamic Management Views (amongst others).

So, when you are looking at the plans in the plan cache or some other DMV, you are using this nifty operator:

SELECT usecounts, cacheobjtype, objtype, text 
FROM sys.dm_exec_cached_plans 
CROSS APPLY sys.dm_exec_sql_text(plan_handle) 
WHERE usecounts > 1 
ORDER BY usecounts DESC;
(actually you are using it _twice_ in this case, as there are two different sys.dm* references).

  |--Nested Loops(Inner Join, OUTER REFERENCES:(SYSDMEXECCACHEDPLANS.[plan_handle]))
       |--Sort(ORDER BY:(SYSDMEXECCACHEDPLANS.[usecounts] DESC))
       |    |--Filter(WHERE:(SYSDMEXECCACHEDPLANS.[usecounts]>(1)))
       |         |--Table-valued function
       |--Table-valued function

In this plan, you can see that rows are pulled from one TVF, are filtered, then sorted, then passed into the second TVF.

The "OUTER REFERENCE" is the key that the loops join is passing something from the inner side to the outer side (remember in the crazy syntax used here, the first row below the Nested Loops is the "inner" and the second line is the "outer".  So, the Sort/Filter/TVF is the "inner" in this case.

These things differ from the traditional TVFs that you could create in SQL Server as a user.  Those were actually backed by temporary tables in many cases, and this made it a pretty heavy-weight solution if you just want to pass a few rows into the server.  _Streaming_ sounds a lot better than that, you must admit.

Many operators in a QP are streaming - filter is, as is a table scan.  However a sort operator is called a "stop-and-go" operator because it generally consumes all of its input rows before producing the first output row.  Streaming TVFs would fall into the "not stop-and-go" category.

This operator is used heavily starting in SQL 2005 in lots and lots of places.  It is not really something you can directly reference (there is no exposed syntax for this operator), and as such it requires a bit of hunting to track down the various places it gets used.

I'll leave it as an exercise for you guys to find a few non-DMV places that use STVFs.  Any TVF that takes arguments is likely an STVF.  So go look at those query plans, folks!  First one who tells me a non-DMV/DMF STVF feature gets a star next to their name! 

I also know that there is at least one new feature in SQL Server 2008 that uses the STVF.

Happy querying!

Conor Cunningham

Categories:

The blog was down a bit today - sorry for all who got to see CLR stack traces pushed through HTTP (I *will* do a post on that sometime ;)...


A reader sent in a question about my previous RAND() post (here) and wanted to know if other functions were pre-evaluated.  The specific example was about DATEADD(), but my answer will apply to most functions.

The short answer is "DATEADD() is not pre-evaluated once per query in the same way as RAND() is".

Most functions in SQL Server are evaluated once per row, just as you'd expect.  Even most non-deterministic functions are executed once per row.  The key reason for this is that often these functions are run over column data from a row in a table, and the value can change for each row.

So, this dateadd function is executed once per row:
select DATEADD("dd", 5, col1) from t1
Here's part of the showplan_xml output for this query
            <ColumnReference Column="Expr1004" />
                    <ScalarOperator ScalarString="dateadd(day,(5),[t1].[dbo].[t1].[col1])">
                      <Intrinsic FunctionName="dateadd">
                        <ScalarOperator>
                          <Const ConstValue="(4)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Const ConstValue="(5)" />
                        </ScalarOperator>
                        <ScalarOperator>
                          <Identifier>
                            <ColumnReference Database="[t1]" Schema="[dbo]" Table="[t1]" Column="col1" />
                          </Identifier>
                        </ScalarOperator>
                      </Intrinsic>
                    </ScalarOperator>

And here is what you see in the showplan_xml for RAND():
select RAND() from t1
                    <ScalarOperator ScalarString="rand()">
                      <Identifier>
                        <ColumnReference Column="ConstExpr1005">
                          <ScalarOperator>
                            <Intrinsic FunctionName="rand">
                              <ScalarOperator>
                                <Const ConstValue="" />
                              </ScalarOperator>
                            </Intrinsic>
                          </ScalarOperator>
                        </ColumnReference>
                      </Identifier>
                    </ScalarOperator>

Now, those of you paying close attention will notice that the _name_ of the column is "ConstExprXXXX" instead of "ExprXXXX".

I don't think that I'd rely on the name meaning much, and I don't even recall if there are other things that use ConstExprXXXX names.  It might be a good indicator that you are getting warmer if you are looking for which functions are executed once at the start of the query.

If you use the query patterns that I posted and that others posted in comments, you can determine what is happening for each function you wish to investigate.

You can probably look through the list of public functions (or get fancy and run strings.exe on the sql server binary) to look for function names that you can evaluate to see if they are showing up as ConstExpr or non-ConstExpr.  One that comes to mind is getdate() - it's non-deterministic and you likely don't want the results to change during a long-running query.  I'll leave you guys to go find any others - I don't believe that there lots of these types of functions.

As a side note for all you readers out there, I won't be talking about internal details of the system beyond what is published or what is discussed publicly.  So, you won't be hearing from me about SQL Server's secret plans for including bullhorns in every box of SQL Server or how the logging feature is really made from Area 51 technologies.  However, I will tell you guys the basics of database and QP design and show you the *external* places where you can look to see what information Microsoft makes available to you.  Most of the time I plan to point you to some experiment you can run that will tell you, with a little context, enough to answer whatever question you have.

Have a great weekend all - I'm back to painting a room in my house.

Conor Cunningham

Categories:

I've been playing with the new grouping sets feature now that it's in a CTP, partially because I had a small hand in them before I left Microsoft, but partially because I'm trying to re-examine features from a more external perspective.  So, I figure this is a good conversation starter.

So, as all programmers are lazy, I started with my former collegue Craig's blog post on the subject (http://blogs.msdn.com/craigfr/archive/2007/10/11/grouping-sets-in-sql-server-2008.aspx).  Craig's a query execution guy by training (and a darn good one), so many of Craig's blog posts are pretty complimentary to my background.  Anyways, his post doesn't talk about the various plans.  Part of the reason is that GROUPING SETS contains a fair amount of syntactic sugar.  It's a feature that lets you write CUBEs and ROLLUPs in a different way. It's slightly more general and powerful, but I'm still trying to get my head around why Oracle/IBM pushed to get this added to the ANSI spec.  You can do most of this with CUBE + filters (and a good query optimizer).

So let's talk about cube and rollup plans from an outsider's perspective, as I haven't seen anyone write about this in the depth that I would like.

If I take a the rollup query from Craig's example in SQL Server 2008's November CTP, I can see the following plan:



ok, two aggregates - what's up with that?  Well, if you look at the properties for the second aggregate, you can see that there is an extra bit of information that tells you that it's actually not a classical aggregate operator:



ok, so this guy does some special magic but it is represented in the external query plan as an aggregate function evaluator.  It just adds the extra rows that you see in a ROLLUP query. 

The key part of ROLLUPs is that they can be computed in one linear pass of the data (sorted appropriately).

CUBE plans are a little trickier:


So the QP conatenates a number of different branches of operations, and it actually ends up being a way to compute the matrix of different dimensions - if you have more columns in your cube query, you'll get more branches.  If you look at the second "stream aggregate" in a branch with 2, you'll see the rollup information just like in the rollup example. 

So CUBE is a fancier form of ROLLUP.  Obviously it has one sort per dimension, which is EXPENSIVE on big tables.  (You could have indexes for each branch, which could make things faster, assuming you have the disk space for the indexes). 

In SQL 2005, the plan is slightly different for Craig's example:



This plan is great if the size of the data going through the CUBEing operation is much smaller than the size of the table.  In this case, it writes the data to a temporary spool, sorts the output of that, and then re-reads that spool in the second branch.  I don't recall if SQL Server 2005 generated the SQL 2008 plan or not, but the SQL 2008 plan is potentially more parallelizable since each branch is independent.  Note that reading the base table twice could open you up to more interesting locking, but this can happen in any query with multiple scans of the same table if your isolation level is too low.  I haven't tried to create a parllel CUBE plan on SQL 2008, but I'd guess it is possible...

ok, so now we can add in GROUPING SETS.  These are essentially CUBEs with the ability to prune branches you don't need.  This can get more interesting when you have lots of columns on which you are CUBEing, but the example from Craig that I'm using here doesn't do much beyond CUBE.

Technically, one could try to start filtering NULLs from the result of a CUBE and perhaps the query optimizer could work backwards to deduct that it didn't need to compute various branches of a CUBE like GROUPING SETs can do.  Obviously it's easier when the customer just tells the QP what it wants directly.

I haven't tried to exhaustively go through the various kinds of plans likely in SQL Server.  There aren't really that many with ROLLUP/CUBE, at least not from what I've seen.  However, I hope this gives you an intro into how they work and how you can look at the query execution plan to see what is happening.

Now's the part where I get to ask for your input so I can learn something ;).  I want to hear about scenarios where you use CUBE/ROLLUP today - please send me a mail at conor@sqlskills.com!  I'll post up a summary of what I learn.  If you plan to use GROUPING SETS, let me know that too.  I'm very curious about what scenarios people do relational computations like this (as opposed to using Analysis services, for example).

Thanks for reading!

Conor Cunningham

Categories:

So there's all sorts of legacy behaviors in SQL Server that will keep me employable for years and years because it didn't make sense when Microsoft started shipping SQL Server and there's too many people to change it now ;).

RAND() is one of those things that behaves differently in the various commercial database vendors, and it takes some time to grok what's happening.

Let's say that I want to get a random set of rows from a table - that seems like a nice thing to do, right?

So I go write:

SELECT * FROM MyTable WHERE RAND() > 0.5

This seems like a great first attempt to get about half the rows back, assuming a good pseudo-random number implementation.  However, you'll notice that if you run this, the behavior isn't quite what might be expected (at least the way I've described it in this example).

So when one executes this query, you'll notice that sometimes you get all rows from your table and other times you'll get none.  But wait - functions are run per-row, right?  Well, no, at least not in all cases.  The complete details are beyond a single blog posting, but for largely historical reasons the RAND() function is actually executed per query, not per row.  That means that, logically, you get the query from the plan cache, get it ready to run, and then run RAND() once, caching the value.  Then, each and every row uses that cached value when evaluating any scalar logic. 

So, for this example, either that initial call to RAND() returned a value greater than .5 or it didn't.  As such, you get all or no rows only.

SQL Server added TABLESAMPLE as a mechanism to address part of this need - retreiving a sample from a table.  However, it isn't arbitrarily composable, and the semantics used in the default system implementation in SQL 2005 are actually intended to be used for statistics generation - a slightly different goal than quality sampling.

But wait - a lot of functions DO execute per row.  How can I tell which is which?  Well, I am not sure that there is a publicly exposed way, other than trial and error, to determine this.  I need to go do some more homework, as my memory is, well, random.

Even worse, some function computations are moved around in query trees, and so the actual number of executions may not be what you think as "per row".  For example, if you execute a filter before/after the scalar logic, then the results may be different.  Put another way, if you use an index seek vs. a table scan, how many times do you expect any old function to be executed?

It's enough to make your head spin! 

The basic rules are that, as long as the optimizer knows about the semantic issues, it will try very hard to not move around functions who have different semantics when executed different numbers of times.

So, be careful with your functions, built-in, user-defined, or whatever exotic form gets dreamed up next.  There's a lot going on under the covers, and knowing a few details can help you more deeply understand what a query is doing.


Thanks,

Conor Cunningham



Categories:

I previously blogged about the sequence project operator, which I helped architect/develop while working on the SQL Server QP (here).  In that same post, I promised to tell you about the Segment operator in more detail as well.

Segment is largely a "helper" operator in SQL Server.  It does very little and there isn't a direct way to create one.  They are often used to help facilitate more complex operations.

Every operator in SQL Server's QP takes rows of input and produces rows of output.  Some have multiple children (joins), while others are just unary.  Segment is a unary operator that takes rows of input and determines if some set of columns have changed from the previous row.  Usually, the input on those columns are ordered or grouped to make this meaningful.  An extra column is created, like in a compute scalar operation, which records if the value has changed.  That computed value is then used to drive other physical operators.

You can see what columns are considered in each segment by looking at showplan_xml (from the ranking query in my previous post):

                <RelOp NodeId="1" PhysicalOp="Segment" LogicalOp="Segment" EstimateRows="1000" EstimateIO="0" EstimateCPU="2e-005" AvgRowSize="23" EstimatedTotalSubtreeCost="0.0328114" Parallel="0" EstimateRebinds="0" EstimateRewinds="0">
                  <OutputList>
                    <ColumnReference Database="[t1]" Schema="[dbo]" Table="[z]" Column="col1" />
                    <ColumnReference Database="[t1]" Schema="[dbo]" Table="[z]" Column="col2" />
                    <ColumnReference Column="Segment1005" />
                    <ColumnReference Column="Segment1006" />
                  </OutputList>
                  <Segment>
                    <GroupBy>
                      <ColumnReference Database="[t1]" Schema="[dbo]" Table="[z]" Column="col2" />
                    </GroupBy>
                    <SegmentColumn>
                      <ColumnReference Column="Segment1006" />
                    </SegmentColumn>


The grouping columns are listed, and you can see these "Segment100x" columns that get fabricated along the way.  These are not surfaced outside the query, so you can't see the values of this directly.  However, you can see if the grouping columns have changed.

So, in this plan we have two segment columns on different grouping column sets, both of these passed into the sequence project (you can only actually see that they stop being propogated in the showplan_xml, but the QP is usually smart enough to not pass things along unnecessarily, so that usually means that the buck stops where it stops for a reason :).

Remember, all of the operators can change from release to release.  I don't believe any of this changed substantally from 2005 to 2008 based on what I can see in the plans.

That concludes our daily episode of the operator of the day.  You guys have a good weekend.  I am enjoying warmth and sunshine in Austin in February, which I can tell you is a substantial change from my life while in Seattle.


Categories:

Theme design by Nukeation based on Jelle Druyts