Sunday, January 27, 2008
For a number of years, the marketing for SQL Server's competitors made a large number of claims that SQL Server couldn't handle "Enterprise"-scale deployments.  Over time, those claims became much quieter as Microsoft used SQL Server extensively to run their whole business, through multi-terrabyte reference cases, and the hugely successful TerraServer project where Jim Gray led a team trying to build a huge database of satellite images using SQL Server.

This doesn't mean that there aren't challenges for all of the major vendors in adding features to help their products scale.  Each of them have some failings in some of their product offerings, so please don't interpret this as a "Microsoft SQL Server can't do the job" post.  On the contrary, I think that there is little that it can't do, but I know where the limitations get hit and I have some expertise in working around the current limitations.  Each Data Warehouse problem is a bit different, so the kinds of problems I'll discuss here may or may not happen for any given deployment.

In my previous post, I discussed how subqueries are approached by the query optimizer (as joins), and this can lead to problems when the cardinality or cost estimates are incorrect.  Often when an estimate is incorrect, "unrolling" the subquery by hand and creating a temporary table for a portion of the query will correct the mistake.  Splitting a larger query like this makes the operation more procedural and makes each piece smaller.  It also happens to give the optimizer better cardinality estimates for the portion of the query that has been unrolled - these better estimates may help the rest of the query perform more to your expectations.  This all assumes that:
1. You have the skills and desire to unroll a subquery
2. You know how to unroll it to give the optimizer information to help avoid cardinality estimate errors
3. That the query is known a priori.  Dynamically-generated queries are harder to hint

The best way to track down estimation errors is to find some time when you can run a problematic query with "set statistics profile on" enabled.  While this does make the query run more slowly, it captures useful information to help identify areas of a query where the estimated and actual cardinality are materially incorrect.  I've blogged about this previously (in previous blogs, not this one), so I'll do a post linking some of the useful backgrounders on this area.

Data Warehouses are usually the largest of the large in terms of SQL Server deployments, and usually they have a dedicated DBA to help keep it working.  This is partially because any business problem worth hundreds of thousands of dollars in hardware is also probably worth the salary of a DBA to keep it working.  However, the other side of this coin is that there aren't enough data warehouses in the world to make it truly a commodity software purchase.  As such, lots of little things can and do go wrong in data warehouse deployments that often require expertise to fix.

Let's talk about some of the reasons why the query optimizer may make incorrect guesses on data warehouse tables, starting with some of the easier ones

1. missing statistics.  I've seen a bunch of customers disable auto-create statistics because "someone told them to do so" or "I don't trust those things".  Unless you are going to hand-tune every query, you should likely leave auto-create enabled.

2. out-of-date statistics.  This is a bit trickier.  Column statistics are updated when a certain number of changes are made, and sometimes that hasn't been triggered.  You can manually update statistics in those cases.  For customers who disable auto-update statistics (<2%, but this happens a lot on the higher-end deployments), you need to have a schedule to update those stats so that the optimizer has a fighting chance.

3. a statistics sample that does not reflect the whole data set.  There are a number of assumptions made during statistics creation that are more pragmatic than mathematically proven.  To make auto-stats possible, only a small percentage of the pages are actually used to create statistics for non-indexed columns.  If those pages don't cover the whole data range well, queries over the missing ranges could get poor estimates.

4. Fundamental limits in the statistics object.  Currently there can be up to 200 steps in the histogram that is created.  If your table's data has less than 200 interesting spikes in the distribution curve, then you are probably fine.  However, your fact table may benefit from more.  Unfortunately, this can be a problem even with full stats enabled in SQL 2005 and, from what I have seen so far, in the CTPs of SQL 2008.

5. model limits in the optimizer.  Optimizers make assumptions so that they can compile a query in a reasonable amount of time.  For example, an optimizer may assume that columns are, in general, statistically independent of each other for the purposes of cardinality estimation because it's the only computationally feasible assumption to make.  This assumption is very often not true, and it can introduce errors in cardinality estimation that can lead to poor plan selection.  As databases get larger, these "out of model" errors can get larger and require more hand-holding to make sure that a given query gets a good plan from the optimizer.


I hope this gives some insight into how cardinality estimation impacts join optimization and thus subquery optimization in SQL Server 200x.

Thanks,

Conor

Sunday, January 27, 2008 2:12:04 PM (Central Standard Time, UTC-06:00)  #    Comments [0]  | 
I've been trying to think of a good way to introduce how SQL Server does subqueries based on a question from a reader.  One of the tricky pieces to query optimization is that the "language" used internally in the optimizer is actually somewhat more expressive than SQL.  The second part of the problem is that, for people who have used other products at one time or another, it's a challenge to get them to re-learn something to grok how SQL Server does it (and even to figure out what they need to re-learn).

So I'm going to do a series of posts on subqueries and the related pieces behind subqueries to put them in context.

Let's start at the beginning.  A subquery is a fancy filter condition where you can specify an query to run for each row of the outer query.  This doesn't mean that's how the optimizer will actually execute it, but you can start thinking about it like that:

SELECT * FROM MyTable WHERE productkey IN (SELECT * FROM Orders where MyTable.productkey=Orders.productkey)


One way to execute this query is to read each row from MyTable, get productkey, start up another query to look through orders, and determine if we want to keep the row from MyTable based on the result of that second query.  This is how some of the other QP's on which I have worked attack subqueries and is perfectly valid.

In the internal lanaguage of the optimizer (and you can read this in output plans), you will typically see that this is executed physically in SQL Server using a "semi-join", which you should imagine as a join operator that does what we've described within one single query instead of running an extra query per row.  A semi-join will return rows from only one side of the join (left or right semi join).  These semijoins are nice because you can actually re-use your physical join implementation to execute them.  They are also nice because you can also leverage your tree rewrite logic and cardinality estimation logic to help you think about these like joins as well.  So, you can reorder and rewrite the query into alternative, equivalent forms that might run more quickly.

The nice part about this approach is that you get the benefit of a lot of work and experience estimating cardinality for joins, which helps the SQL Server Optimizer make decisions about these rewrites.  The bad part about this is that in queries with many joins or queries with lots and lots of data, your odds of getting a mistake in join cardinality/cost estimation go up and up, and your opportunity to get a plan that is far from optimal also go up.

My next post will cover some of these things in more detail - off to get my kid up.

Conor
Sunday, January 27, 2008 7:59:43 AM (Central Standard Time, UTC-06:00)  #    Comments [3]  | 
Tuesday, January 22, 2008
This post builds upon my previous post on some of the various kinds of plans for the IN operator.  I will cover some of the query plan cache behavior associated with IN plans.

For this post we'll be using the following query which will let you look at the text of plans in the cache and also see how many plans are in the cache.  (One can use "dbcc freeproccache" to clean out the cache whenever you are testing this - please don't do this on a production system unless you know what you are doing)

(Hey, I figured out how to post code properly in the blog.  and there was much rejoicing! yeah!)
SELECT text, plan_handle, * FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
CROSS APPLY sys.dm_exec_text_query_plan(qs.plan_handle, qs.statement_start_offset, qs.statement_end_offset) AS qp

OK, the two things of interest are the number of rows returned and the text of each plan.  I will note that the data in these sys.dm_exec* functions is "mostly up to date", but it doesn't follow an exact transactional semantic - so, you may need to wait a second or two to get the "proper" state of the plan cache after each step.

If you run dbcc freeproccache, then this query, you'll see one entry in the result (for the query you ran to get the status).  If we run query 1 from our previous post (select * from t1 where col2 in (1)) and look at the result, we get something interesting:
(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1

Hey, what's this?  Well, this is called "auto-parameterization" in SQL Server, and it happens on Trivial Plans (see previous post).  The idea is that SQL Server has a fancy query processor, but fancy means powerful for big queries and perhaps not the fastest on the small stuff.  In general, we're not saying that it's slow, but if you are compiling thousands of these things right in a row, you may notice a difference.  So, this autoparameterization feature is an attempt to reduce compilation time by finding plans that have no cost-based decisions and then templatize-ing them so that all other queries that look the same can re-use the plan.  This has a huge advantage for customers with query-generation components, as they often will generate many, many queries that only differ by the constants.  So, SQL Server uses this on trivial plans.

If you run the query "select * from t1 where col2 in (2)" (so, the constant is changed but nothing else), then run our big fancy cross apply query, then you see 2 query plans instead of 3.  Both of our new queries use the following plan:

(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1

Now, running queries with more than one element in the IN list generates a plan for each of them.  (I don't recall exactly why off-hand, but I'll think about it some more)

So, you can see the plan entries for a range of variations.

select  * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2,3,4,5,6,7,8,9,10)
(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1
select * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2)
select * from t1 where col2 =1 or col2=2

I'll point out that, except for trivial plans that are auto-parameterized, the plan text is used as the "key" into this cache.  It does not logically compare the output plans or even a normalized form of the input.  So, you'll notice that "
select  * from t1 where col2 in (2,1)" is in the cache twice because I put a space in the second time I ran that query.  So, you need to consider this when we get to the original question about plan cache pollution.  If your query generation engine (LINQ or whatever) adds comments into the query text sent to SQL Server, that could very well lead to more plans in the cache.

The more complex queries from my last post also generate unique plan entries in the same way, so I'll skip those here.

The next interesting factor to consider is forced parameterization.  This was added in SQL Server 2005, and it is both powerful and dangerous.  This feature is a database level feature, most literals are converted to paramters before query optimization in most query plans.  This is covered under the "Forced Parameterization" topic in SQL Server books online, so I'll suggest you read that if you want all the details.  I did a post or two on plan sensitivity to different parameter values, so beware that in some cases you may end up with query plans that are suboptimal and are stuck in the cache.  So, you should experiment with this setting to make sure that it really does benefit your application before you turn it on.

As I like to live dangerously, I will ignore all of these warnings and proceed with great vigor in the name of Science!

Let's try three queries that yielded separate plans without forced parameterization enabled:
select * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2)
select  * from t1 where col2 in (2,1)
Now we get one entry in the cache, and it looks similar to those auto-parameterized queries:
(@0 int,@1 int)select * from t1 where col2 in ( @0 , @1 )

Notice that even spaces are "fixed" in terms of cache hits in this mode.  The query is parsed and then the text is generated from the parsed tree, which gives one the ability to normalize the input SQL for the plan cache lookup.

Adding a 3-element IN list query generates a new plan in the cache:
select * from t1 where col2 in (1,2,3)
(@0 int,@1 int,@2 int)select * from t1 where col2 in ( @0 , @1 , @2 )
(@0 int,@1 int)select * from t1 where col2 in ( @0 , @1 )

OK, so we can get a plan per number of elements, which perhaps has its advantages in some cases.  Good to know.

I also tried our large cardinality IN list example (the one that generated the semi-join in my previous post):
(@0 int,@1 int,@2 int,@3 int,@4 int,@5 int,@6 int,@7 int,@8 int,@9 int,@10 int,@11 int,@12 int,@13 int,@14 int,@15 int,@16 int,@17 int,@18 int,@19 int,@20 int,@21 int,@22 int,@23 int,@24 int,@25 int,@26 int,@27 int,@28 int,@29 int,@30 int,@31 int,@32 int,@33 int,@34 int,@35 int,@36 int,@37 int,@38 int,@39 int,@40 int,@41 int,@42 int,@43 int,@44 int,@45 int,@46 int,@47 int,@48 int,@49 int,@50 int,@51 int,@52 int,@53 int,@54 int,@55 int,@56 int,@57 int,@58 int,@59 int,@60 int,@61 int,@62 int,@63 int,@64 int,@65 int)select * from t1 inner join t2 on t1 . col1 = t2 . col1 where t1 . col2 in ( @0 , @1 , @2 , @3 , @4 , @5 , @6 , @7 , @8 , @9 , @10 , @11 , @12 , @13 , @14 , @15 , @16 , @17 , @18 , @19 , @20 , @21 , @22 , @23 , @24 , @25 , @26 , @27 , @28 , @29 , @30 , @31 , @32 , @33 , @34 , @35 , @36 , @37 , @38 , @39 , @40 , @41 , @42 , @43 , @44 , @45 , @46 , @47 , @48 , @49 , @50 , @51 , @52 , @53 , @54 , @55 , @56 , @57 , @58 , @59 , @60 , @61 , @62 , @63 , @64 , @65 )


Yes, it seems to keep on going.  For once, I read the documentation and determine that one of the disqualifying conditions is:

Statements that contain more than 2,097 literals that are eligible for parameterization.

So, it will stop at some point based on how many literals you have in your query.

So while we're not up to LINQ yet, I think we have a pretty good basis to evaluate whether LINQ or any other query generator is going to cause plan cache issues.  I'll try to do some research on the shipping version of LINQ and see what I can determine by examining the plans that get generated.

That's enough for one evening from me.

Thanks,

Conor
Tuesday, January 22, 2008 9:02:28 PM (Central Standard Time, UTC-06:00)  #    Comments [1]  | 
In this post, I'll go over a few things about IN plans in the SQL Server Query Optimizer.  Currently I'm using the SQL Server 2008 November CTP, but this behavior is still valid on SQL 2005.

I'll be using the following script for my discussion, and you can try this on your own.

create database in1
use in1

create table t1(col1 int identity, col2 int default rand()*100000, col3 binary(2000))

declare @i int
set @i=0
while @i < 10000
begin
insert into t1 default values
set @i=@i+1
end

create table t2(col1 int identity, col2 binary(200))

-- plan 1
select * from t1 where col2 in (1)

-- plan 2
select * from t1 where col2 in (1,2,3,4,5,6,7,8,9,10)

-- plan 3
select * from t1 where col2 in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)

-- plan 4
select * from t1 inner join t2 on t1.col1=t2.col1
where t1.col2 in (
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,61,62,63,64)

-- plan 5
select * from t1 inner join t2 on t1.col1=t2.col1
where t1.col2 in (
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,61,62,63,64,65)


So, we have a database with 2 tables, and I created a bunch of random data in a column so that we can play with IN plans.


ok, plan 1 is "easy".  I use IN with a single operator.  This happens to turn into a simple plan that looks like this (using set showplan_text to view):

  |--Table Scan(OBJECT:([in1].[dbo].[t1]),WHERE:([in1].[dbo].[t1].[col2]=CONVERT_IMPLICIT(int,[@1],0)))

Hrm.  hey, the where clause is in the table scan?  no separate operator?  It's a physical optimization that I'll cover in another post (I may have covered this in my previous blog posts - I'll go check), and it's called "pushing non-SARG predicates" in SQL Server.

Let's compare with the plan for plan 2:
  |--Table Scan(OBJECT:([in1].[dbo].[t1]), WHERE:([in1].[dbo].[t1].[col2]=(1) OR [in1].[dbo].[t1].[col2]=(2) OR [in1].[dbo].[t1].[col2]=(3) OR [in1].[dbo].[t1].[col2]=(4) OR [in1].[dbo].[t1].[col2]=(5) OR [in1].[dbo].[t1].[col2]=(6) OR [in1].[dbo].[t1].[col2]=(7) OR [in1].[dbo].[t1].[col2]=(8) OR [in1].[dbo].[t1].[col2]=(9) OR [in1].[dbo].[t1].[col2]=(10)))

Same plan shape, just more conditions.

I wonder if this goes on forever?  Well, plan 3 looks different:

  |--Filter(WHERE:([in1].[dbo].[t1].[col2]=(1) OR [in1].[dbo].[t1].[col2]=(2) OR [in1].[dbo].[t1].[col2]=(3) OR [in1].[dbo].[t1].[col2]=(4) OR [in1].[dbo].[t1].[col2]=(5) OR [in1].[dbo].[t1].[col2]=(6) OR [in1].[dbo].[t1].[col2]=(7) OR [in1].[dbo].[t1].[col2]=(8) OR [in1].[dbo].[t1].[col2]=(9) OR [in1].[dbo].[t1].[col2]=(10) OR [in1].[dbo].[t1].[col2]=(11) OR [in1].[dbo].[t1].[col2]=(12) OR [in1].[dbo].[t1].[col2]=(13) OR [in1].[dbo].[t1].[col2]=(14) OR [in1].[dbo].[t1].[col2]=(15) OR [in1].[dbo].[t1].[col2]=(16)))
       |--Table Scan(OBJECT:([in1].[dbo].[t1]))

Now the filter is split out into another query operator.  (It starts happening with 16 predicates)

OK, these 3 plans have been simple query plans.  In fact, you can do "set showplan_xml on" and run the query and learn something about how much optimization was done on them:

<ShowPlanXML xmlns="http://schemas.microsoft.com/sqlserver/2004/07/showplan" Version="1.0" Build="10.0.1075.23">
  <BatchSequence>
    <Batch>
      <Statements>
        <StmtSimple StatementText="select * from t1 where col2 in (1)" StatementId="1" StatementCompId="1" StatementType="SELECT" StatementSubTreeCost="2.48687" StatementEstRows="1" StatementOptmLevel="TRIVIAL" ParameterizedText="(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1">

The end-to-end design of the optimizer is beyond the scope of this post, but there are various stages of optimization and queries can "quit early" if they find a "good enough" plan.  In this case, there aren't actually any real cost-based decisions to make about which plan to run, so these plans are called "trivial".

Once you start adding joins into the mix, there are join orders to consider, and those are cost-based.  You can see that the optimization level for plan 4 differs:
<StmtSimple StatementText="select * from t1 inner join t2 on t1.col1=t2.col1&#xD;&#xA;where t1.col2 in (&#xD;&#xA;1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,&#xD;&#xA;51,52,53,54,55,56,57,58,59,60,61,62,63,64)" StatementId="1" StatementCompId="1" StatementType="SELECT" StatementSubTreeCost="2.70289" StatementEstRows="1" StatementOptmLevel="FULL">

One would infer that FULL > TRIVIAL ;).

Plan 4:
  |--Hash Match(Inner Join, HASH:([in1].[dbo].[t2].[col1])=([in1].[dbo].[t1].[col1]))
       |--Table Scan(OBJECT:([in1].[dbo].[t2]))
       |--Filter(WHERE:([in1].[dbo].[t1].[col2]=(1) OR [in1].[dbo].[t1].[col2]=(2) OR [in1].[dbo].[t1].[col2]=(3) OR [in1].[dbo].[t1].[col2]=(4) OR [in1].[dbo].[t1].[col2]=(5) OR [in1].[dbo].[t1].[col2]=(6) OR [in1].[dbo].[t1].[col2]=(7) OR [in1].[dbo].[t1].[col2]=(8) OR [in1].[dbo].[t1].[col2]=(9) OR [in1].[dbo].[t1].[col2]=(10) OR [in1].[dbo].[t1].[col2]=(11) OR [in1].[dbo].[t1].[col2]=(12) OR [in1].[dbo].[t1].[col2]=(13) OR [in1].[dbo].[t1].[col2]=(14) OR [in1].[dbo].[t1].[col2]=(15) OR [in1].[dbo].[t1].[col2]=(16) OR [in1].[dbo].[t1].[col2]=(17) OR [in1].[dbo].[t1].[col2]=(18) OR [in1].[dbo].[t1].[col2]=(19) OR [in1].[dbo].[t1].[col2]=(20) OR [in1].[dbo].[t1].[col2]=(21) OR [in1].[dbo].[t1].[col2]=(22) OR [in1].[dbo].[t1].[col2]=(23) OR [in1].[dbo].[t1].[col2]=(24) OR [in1].[dbo].[t1].[col2]=(25) OR [in1].[dbo].[t1].[col2]=(26) OR [in1].[dbo].[t1].[col2]=(27) OR [in1].[dbo].[t1].[col2]=(28) OR [in1].[dbo].[t1].[col2]=(29) OR [in1].[dbo].[t1].[col2]=(30) OR [in1].[dbo].[t1].[col2]=(31) OR [in1].[dbo].[t1].[col2]=(32) OR [in1].[dbo].[t1].[col2]=(33) OR [in1].[dbo].[t1].[col2]=(34) OR [in1].[dbo].[t1].[col2]=(35) OR [in1].[dbo].[t1].[col2]=(36) OR [in1].[dbo].[t1].[col2]=(37) OR [in1].[dbo].[t1].[col2]=(38) OR [in1].[dbo].[t1].[col2]=(39) OR [in1].[dbo].[t1].[col2]=(40) OR [in1].[dbo].[t1].[col2]=(41) OR [in1].[dbo].[t1].[col2]=(42) OR [in1].[dbo].[t1].[col2]=(43) OR [in1].[dbo].[t1].[col2]=(44) OR [in1].[dbo].[t1].[col2]=(45) OR [in1].[dbo].[t1].[col2]=(46) OR [in1].[dbo].[t1].[col2]=(47) OR [in1].[dbo].[t1].[col2]=(48) OR [in1].[dbo].[t1].[col2]=(49) OR [in1].[dbo].[t1].[col2]=(50) OR [in1].[dbo].[t1].[col2]=(51) OR [in1].[dbo].[t1].[col2]=(52) OR [in1].[dbo].[t1].[col2]=(53) OR [in1].[dbo].[t1].[col2]=(54) OR [in1].[dbo].[t1].[col2]=(55) OR [in1].[dbo].[t1].[col2]=(56) OR [in1].[dbo].[t1].[col2]=(57) OR [in1].[dbo].[t1].[col2]=(58) OR [in1].[dbo].[t1].[col2]=(59) OR [in1].[dbo].[t1].[col2]=(60) OR [in1].[dbo].[t1].[col2]=(61) OR [in1].[dbo].[t1].[col2]=(62) OR [in1].[dbo].[t1].[col2]=(63) OR [in1].[dbo].[t1].[col2]=(64)))
            |--Table Scan(OBJECT:([in1].[dbo].[t1]))

ok, this looks like an extension of what we saw in plan 3 - a filter operator above a table scan, followed by a join.

Query 5 is only trivially different than 4 - 65 conditions instead of 64:

  |--Nested Loops(Left Semi Join, OUTER REFERENCES:([in1].[dbo].[t1].[col2]))
       |--Hash Match(Inner Join, HASH:([in1].[dbo].[t2].[col1])=([in1].[dbo].[t1].[col1]))
       |    |--Table Scan(OBJECT:([in1].[dbo].[t2]))
       |    |--Table Scan(OBJECT:([in1].[dbo].[t1]))
       |--Filter(WHERE:([in1].[dbo].[t1].[col2]=[Expr1008]))
            |--Constant Scan( VALUES:(((1)),((2)),((3)),((4)),((5)),((6)),((7)),((8)),((9)),((10)),((11)),((12)),((13)),((14)),((15)),((16)),((17)),((18)),((19)),((20)),((21)),((22)),((23)),((24)),((25)),((26)),((27)),((28)),((29)),((30)),((31)),((32)),((33)),((34)),((35)),((36)),((37)),((38)),((39)),((40)),((41)),((42)),((43)),((44)),((45)),((46)),((47)),((48)),((49)),((50)),((51)),((52)),((53)),((54)),((55)),((56)),((57)),((58)),((59)),((60)),((61)),((62)),((63)),((64)),((65))))

Hey, what's this?  A Semi Join?  Well, this is a join where rows from one side are joined with another table (in this case, an in-memory table) and only qualifying rows from the original table are kept.  To make this work, you can't return any columns from the "other side", and you have to make sure you don't have more than one match on the constant table.  So, you need to remove duplicates.  Try adding another '1' into the IN list and then look at the plan - it removes them.

Why are there different plans at these points?  Well, while I can't speak for the development team, as I'm not on it anymore, but I will say that often these are added as an attempt to deal with poor performance or some very common queries that are seen all the time (perhaps from a large ISV).  There's obviously not a ton of magic to why 16 and 64 are picked for the magic numbers, and I'm sure if we hunted we could find some cases where the transition between plans was not perfect.  Often changing the numbers, even if they aren't quite perfect, will just cause instability elsewhere in terms of plan quality so they are left to try to avoid pain to customers.  I'm hopeful that you can use this post to start to do the same kind of reasoning yourself.  While I have a pretty good idea of where the plans will change, I'm not revealing anything you can't see with some trial and error on your own.  I actually had forgotten the 64 number, as I think it changed at one point - so, I had to do figure it out again ;).

OK, this covers the basic forms of plans with INs that are available in SQL Server.  I plan to work on the plan cache next, including forced parameterization.

Thanks,

Conor
Tuesday, January 22, 2008 8:16:50 PM (Central Standard Time, UTC-06:00)  #    Comments [2]  | 
Saturday, January 19, 2008
I've already received a few comments and mails from people asking questions...

One comment I've received is about LINQ's IN operator(method?) and SQL Server plan cache pollution.  I'm researching the issue a bit before I post a complete answer, but I'll start with some background information on memory management in SQL Server.  I co-authored a paper on a new memory management algorithm that I helped develop that went into SQL Server 2005.  I don't think it has changed much in SQL Server 2008 from what I've seen of the November CTP at least.

The interesting thing about memory in SQL Server is that, unlike some of the other vendors in the database space, SQL Server runs all of its users within the same server process.  (This derives from UNIX traditionally using fork() to create another server process to service each user).  Windows has better thread support, and as a result it was possible to make SQL Server run all of the user activities within a single process.  This has advantages because you can avoid inter-process communication for a lot of things, like a plan cache.  You can also "be smart" on things like which activity to schedule next in ways that an Operating System will not do.  Often this can lead to large performance gains.

One downside of this single-process approach is that, at least on 32-bit machines, virtual memory space becomes a limited resource, especially on larger deployments.  Every component has to fit into about 1.4GB of space (with the remainder taken up by stacks for threads, address space for mapping DLLs into the process, etc).  The limitation does not exist on 64-bit machines, which is why I recommend that all new serious SQL Server deployments closely consider their architecture when purchasing new machines.

This paper was accepted at the Conference on Innovative Data Systems Research (CIDR), of which Jim Gray was one of the founding organizers.  It has some details in the kinds of problems that were solved and the performance gains that were achieved by more actively managing the memory consumption and consumption rates of each internal subcomponent.  In a sense, this is how one could build a special-purpose memory management system for a machine that only ran a database.

Memory-CIDR2007Accepted2.doc (400 KB)

I will start working on some posts on some of the interesting things that happen in the query processor to handle IN clauses (in SQL, not LINQ).  Then I'll do some work with LINQ's IN equivalent and see what interesting things I can post. 

Thanks,

Conor Cunningham
Saturday, January 19, 2008 11:28:32 AM (Central Standard Time, UTC-06:00)  #    Comments [0]  | 
Wednesday, January 16, 2008
Here is the introductory post to my new blog on SQL, data-driven applications, performance, scaling, and anything else I feel like posting.  For those of you who do not know me, Paul Randal (http://www.sqlskills.com/blogs/paul/) and I worked together on Microsoft SQL Server for a number of years.  While Paul worked in the Storage Engine, I worked on the Query Processor.  Like Paul, I left Microsoft about a year ago (all good things must come to an end...).  Paul and Kimberly have been gracious enough to host me on their site so I can help answer questions about SQL Server, teach people a bit about what I know, and force me to learn a few things as well.

While at Microsoft, I was a development lead on the Query Optimizer and had architectural dutues for much of the query pipeline.

I actually blogged in my official capacity at Microsoft.  You can read about  my previous posts  here (http://blogs.msdn.com/queryoptteam/) and here (http://blogs.msdn.com/sqlqueryprocessing/default.aspx).

I'll be posting some content about the new SQL Server 2008 features I find interesting over the coming weeks.  I'm also happy to answer general questions about the Query Processor or SQL Server in general.

So, welcome!

Sincerely,
Conor Cunningham

Wednesday, January 16, 2008 8:12:42 PM (Central Standard Time, UTC-06:00)  #    Comments [2]  | 

Theme design by Jelle Druyts

Pick a theme: