http://technet.microsoft.com/en-us/magazine/cc434693.aspx

It covers some of the new features in SQL 2008 w.r.t. Data Warehousing.

Conor Cunningham

Categories:
SQL Server 2008


I spent some time playing with the SQL 2008 geometry data type.  This post will cover some of the things I have learned about how it is implemented from playing with the feature.  Some of these will impact the feature's usability for some.  Others may not care.

1. late-binding of the geometry data is an interesting choice.  So one of the nice properties of a SQL query is that, for many things, the parse and bind of a query can find all sorts of compilation errors that would show up during development.  You don't have to run the query in order to see the simple mistakes, and that helps us all.  This type doesn't throw parsing errors until you try to execute the query.  That's not perfect - you have to go execute spatial queries to see whether you have typed in lots of complex data correctly or not.  I can speak from experience that I tend to mess that stuff up.

Example:

create table g1 (col1 int identity, col2 geometry)
go
insert into g1(col2) values (Geometry::STGeomFromText('LINESTRING(0 0, 10 10, 21 2)',0));
go
insert into g1(col2) values (Geometry::STGeomFromText('LINESTRING(funky chicken)',0));
go
That third query compiles just fine - running it returns an error.

2. The geography type uses some form of the .net framework.  (I'll point out that I have not enabled the CLR in my server).  Given that the ordpath type requires that the CLR be enabled, I'll guess that this means that there is a special implementation of the geometry type.

Here's the runtime errof from the previous query:

Msg 6522, Level 16, State 1, Line 1
A .NET Framework error occurred during execution of user-defined routine or aggregate "geometry":
System.FormatException: 24141: A number is expected at position 16 of the input. The input has funky.
System.FormatException:
   at Microsoft.SqlServer.Types.OpenGisWktReader.RecognizeDouble()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParsePoint()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseLineStringText(FigureAttributes attributes)
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseLineStringTaggedText()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseGeometryTaggedText()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ReadGeometry()
   at Microsoft.SqlServer.Types.SqlGeometry.STGeomFromText(SqlChars geometryTaggedText, Int32 srid)
.
The statement has been terminated.

(I think I will be making t-shirts that say "The input has funky" on it... let me know if you want in :).

I am actually pretty happy with the error message - the only thing missing is which column caused the error.  You'll notice that "position 16" is inside of this second string being parsed by the .NET routines.  I personally do find stack traces unacceptable for general use, but the .net and java folks seem to have doubled down on this approach, so we all get to learn about their libraries even in SQL now...

This is just a seam in how it was implemented.  Be aware of this when using the type.

3. the geometry type has been (arbitrarily) given an estimated row width of about 4000 bytes.  well, this is what shows up in showplan.  So I inserted a bunch of data into this table (1.4 million rows in about 148MB).  The cost estimates don't seem to use the 4000 byte number for the row width (as it would take a long time to read 5.6GB - more than 13 seconds or so, given that is what the costing says, so I am guessing that this is just reported in showplan the same way it is costed internally).  I'm just using a heap (no indexes) for this experiment.

4. I'm waiting for maps.live.com  to make it easy for me to geocode addresses and insert them into SQL.  It has a feature to send it to some gps device, but I haven't played with it beyond learning that it didn't detect that I had blocked cookies by default ;).

5. "select col2.ToString() from g1" is a way to see something useful from that data type.  otherwise you get back binary goo...

That's all for tonight.

Thanks,

Conor Cunningham





Categories:
SQL Server 2008

I started off tonight playing with the new page compression feature.  So far I like it.  I haven't found something yet about which I want to post (which is code for "I'm still looking for the seams ;)"), but I have some other things you can try to learn a few things about how the SQL Server QP makes assumptions about various kinds of predicates in queries during cardinality estimations.

So, you may or may not know much about how the SQL Server QP figures out what plans to run.  For the uninitiated, it almost seems like some form of magic.  In some ways, it is - it's very powerful and poorly understood by many, and it usually requires very little effort by someone skilled in the area to make something amazing happen.  SQL Server merely needs to make itself sparkle when I fix a query plan and I'm set for life :).

In reality, SQL Server uses a cost-based optimizer, which means that it keeps track of all sorts of interesting statistical information, row counts, page counts, etc.  It uses all of these in formulas to come up with numbers for each plan fragment and then it weighs the relative costs of all of these to pick a plan that has the "least cost".  That sounds nice and absolute until you get to go actually try to make that work, and then you are left with all sorts of nasty questions like:

* What should the cost formulas be?
* Do the numbers need to differ based on the customer's hardware? How do we calibrate all of this stuff?  What do we do as machines get faster?
* How do I estimate how many rows are going to come back from one predicate in my WHERE clause or join in time less than running the query to figure it out?
* Same question when I have a bunch of preciates?

Eventually, the QP has to make a set of assumptions so that they can come up with a plan in a reasonable amount of time, both because customers don't like things to ever take time and because managers don't like customers to tell them about how much time something should take..  One assumption might be that, data is uniformly distributed over a data type's possible values when you don't have any better information.  This can help make it possible to come up with solutions that work well most of the time.  The problem is that estimates can be wrong, and that can cause the QP to come up with a different plan than had it had correct information to use when selecting the plan.

So, I'll show you an example here.  To be clear, I'm not saying that this is something that is "broken".  This just exposes a place where 2 different assumptions rub up against each other in a way that will SEEM odd to the outside observer.  When you consider the average customer use cases, these assumptions are not bad and work very well the vast majority of the time...

To the example:
drop table comp1
create table comp1(col1 int identity, col3 nvarchar(3000))

declare @i int 
set @i=0
while @i < 70000
begin
insert into comp1 (col3) values (N'123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890')
set @i=@i+1
end
So I create a table with the same long string in it 70,000 times. 

Then I run a query with a where clause just to get some statistics created:
select  * from comp1 where col3 like '%4567890%'
dbcc show_statistics ('comp1', col3)

Once we have all of this stuff, we can look at the estimates for two very similar queries:

select  * from comp1 where col3 like '123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890%'

select  * from comp1 where col3 like '%123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890%'
(one is LIKE 'abc%'.  The other is LIKE '%abc% where abc is the value we have inserted 70,000 times).

So, both queries will return 70,000 rows. 

Well, the abc% pattern query estimates 70,000 rows (good!).  The second query estimates 69.6985 rows.  that's a bit under ;).

Let's talk about this a bit more so you can understand why.  In the first query, there is an exact string match against a column represented in the statistics histogram.  So, the likely outcome is to take that cardinality count to determine the number of rows that will likely be returned from the query.  In this case, we expect all rows to come back.

In the second one, there is no mechanism to estimate the cardinality of ANY string of this size (SQL Server does have a feature that does smaller strings which is exposed as the "STRING INDEX" in the histogram, but you can't see the details of this object in 2005 and I haven't seen that change in 2008 either).  So, for really large strings, it is left with... guessing.

So, that 69.6985 number is an estimate that is partially based on the length of the string.  Now, the QP could try to walk through each statitistics object and try to find substrings against any existing piece of statistical data and then try to adjust its estimate.  In practice, though, the cost of doing that is expensive.  The various statistics objects are run at different times and have different sample rates, so even then they will vary somewhat.  Finally, for most cases it may just not impact the plan choice that much.  Odds are, though, that this will bite at least one of my readers at some point.  So, this is good to know - it can help you find that spot where the assumption in the QP is causing your query plan to be wrong.  This is the sort of case where you will want to consider a query hint to help the QP out.

There are more assumptions (and seams between them) in the cardinality estimation code.  I'll let you guys go hunt for them a bit to find them.

Happy Querying!
Conor Cunningham




Categories:
SQL Server 2008

You may recall my previous posts on my trouble with SQL 2008 CTP6.  I've made some progress on fixing my machine that I thought I'd share with you.  I now get past the following error:

D:\temp\sqlctp6\servers>setup
The following error occurred:
MsiGetProductInfo failed to retrieve ProductVersion for package with Product Code = '{00E75F61-A126-4CE1-90B8-42295052F1AC}'. Error code: 1605.

I useded the SysInternals (err, Microsoft) Process Monitor Tool and watched for keys found/not found during the failed install.  This found a few keys in HKCR\Installer\UpgradeCodes that were being found early in the setup100.exe process.

(Fair notice - modifying the registry on your computer is your problem, not mine :).

So take the key:
00E75F61

Reverse it.  I think I had 615fe700, but it was late and I was tired.  It might have been 16f57e00.  Anyways, you will see some keys under HKCR\Installer\UpgradeCodes.  There are actually 2-3 places in the registry searched for each key  I've been killing all three of them for each key - there are about 7-10 keys.  The registry section looks like this:



The hex digits you have in the error will correspond to the right hand side of this picture.  The key I've been deleting is it's parent, which is the key being opened in the "key found"/"key not found" stuff in the process monitor log.

Here's what the log looked like for me:

136791  11:53:49.4922341 PM      setup100.exe    2812       CloseFile              D:\temp\sqlctp6\tools\setup\sqlrun_bids.msi                SUCCESS             

136793  11:53:49.4923712 PM      setup100.exe    2812       WriteFile              C:\Program Files (x86)\Microsoft SQL Server\100\Setup Bootstrap\Log\20080313_2353\WOPR_20080313_2353_Detail_ComponentUpdateSetup.txt SUCCESS                Offset: 7,591, Length: 62

136795  11:53:49.4956628 PM      setup100.exe    2812       RegOpenKey                HKLM\Software\Microsoft\Windows\CurrentVersion\Installer\Managed\S-1-5-21-2888934283-224128331-3030229123-1000\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4          NAME NOT FOUND         Desired Access: Read

136796  11:53:49.4957796 PM      setup100.exe    2812       RegOpenKey                HKCU\Software\Microsoft\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4      NAME NOT FOUND                Desired Access: Read

136797  11:53:49.4958199 PM      setup100.exe    2812       RegOpenKey                HKCR\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4 SUCCESS              Desired Access: Read

136798  11:53:49.4958553 PM      setup100.exe    2812       RegEnumValue                HKCR\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4 SUCCESS              Index: 0, Name: 16F57E00621A1EC4098B249205251FCA, Type: REG_SZ, Length: 2, Data:

136799  11:53:49.4958960 PM      setup100.exe    2812       RegOpenKey                HKLM\Software\Microsoft\Windows\CurrentVersion\Installer\Managed\S-1-5-21-2888934283-224128331-3030229123-1000\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4          NAME NOT FOUND         Desired Access: Read

136800  11:53:49.4959398 PM      setup100.exe    2812       RegOpenKey                HKCU\Software\Microsoft\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4      NAME NOT FOUND                Desired Access: Read

136801  11:53:49.4959604 PM      setup100.exe    2812       RegCloseKey                HKCR\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4 SUCCESS             

136802  11:53:49.5146648 PM      setup100.exe    2812       RegOpenKey                HKLM\Software\Microsoft\Windows\CurrentVersion\Installer\Managed\S-1-5-21-2888934283-224128331-3030229123-1000\Installer\Products\16F57E00621A1EC4098B249205251FCA        NAME NOT FOUND         Desired Access: Read

Now the installer gets past this and tries to install the engine and then fails, but I will call this progress ;).

Notice - I had deleted all of my physical files for SQL Server from the machine, so killing the registry keys seemed like a reasonable next step.  I can't promise you it's a good idea since I don't have things working yet, but I hope this helps the many of you who mailed me and found me via search engines.

Thanks,

Conor Cunningham

Categories:
SQL Server 2008

So as you proceed up the river into the jungle, searching for answers about how the query optimizer works, I'll ask you one question: Did you know that there's actually a lot of stuff that the Optimizer team just tells you?  It's in the product, and I'm constantly suprised by how little attention they get.  You can learn all sorts of things by looking at the data.  Now, not all of it is documented, but that's usually just because one needs leeway to change the internals rather than some deep, dark secret that needs to be kept.

Now that you're hooked ;), I'll tell you a bit about an optimizer DMV called sys.dm_exec_query_optimizer_info.  It tells you all sorts of things about how your querys are optimized.  It's a bag of counters for all sorts of things that, with a trained eye, can give you lots of insight into what is happening.

This guy is actually documented, at least partially, on MSDN. 




(That's from my SQL 2008 install, btw).

Some of these fields are "undocumented".  search 0, 1, and 2 are in that category.  I won't talk about them except to say that the names aren't really obfuscated too much. 

To learn about a particular query, you find a nice, quiet server and:

1. select * from this table, store the results somewhere
2. optimize a query of interest,
3. select from this table again, then compare the current totals to the originals.

I think that this is one use of this DMV - trying to figure out why a query takes a long time to optimize.

The other use of the DMV is to get a good statistical picture of a running system.  Say that I'm a DBA and I want to know how many queries in my application have hints or _need_ hints to work well.  Well, this will tell you.  Granted, it doesn't separate recompiles from compiles, and if you have a system where plans are getting kicked out of the cache things may be a bit skewed, but I can tell you that this is far better than simply guessing.  Often the DB application developer doesn't realize that they've built an application that requires a lot of hinting or a lot of compilations, and you can see this in more detail than you get with the performance counters.

I've already talked about "trivial plans", which are not documented in this DMV but are widely known in the other outputs of the system.  I'll let you guys guess about the search 0, 1, and 2 stuff - if you can back up your guess with a public post, book, or other form of comment I'll confirm if you get it right.

Have a great weekend, ya'll.

Conor Cunningham

Categories:
SQL Server 2008

So I'll be posting explanations for each query operator in the output query trees that you can see through "set showplan_text on" in SQL Server.  I'm hopeful that this will give you the tools to better read the query plans being generated by the QP when evaluating your system.

So I'll start with one that I added into the optimizer - the "sequence project".

create table z(col1 int identity, col2 int default rand()*10000)

declare @i int 
set @i=0
while @i < 1000
begin
insert into z default values
set @i=@i+1
end

select RANK() over(order by col2), * from z



If you look at the graphical showplan you get an output plan that looks a bit funny.  A Sequence Project is a special form of projection (the thing that evaluates the scalars in the SELECT list for each row).  By default, a regular project in SQL Server does not preserve nor guarantee any sort properties in the physical plan.  As a result, ranking functions put in regular projects would not preserve order in all cases deterministically.  People seem to complain when the query results are incorrect.

Enter the sequence project....This little beauty does do scalar computations, but they are generally limited to a very small number of operations that require order for one reason or another.  In the case of ranking functions, an order is required as part of the syntax, and you will see that this particular plan needs a sort since there are no indexes on the base table to satisfy a sort request on col2.  (I'll talk about segments in a future post, but for now ignore them and assume that they are part of the sequence project when looking at this plan).

The problem with sequence projects are that while they preserve ordering semantics, they interfere with a lot of the other optimization logic in the QP.  For example, these can interfere with index matching for predicates above the sequence project in a query tree.  Operators that block other optimizations tend to make optimizer developers very unhappy and sometimes elicit violent reactions!  So, be careful when proposing "opaque" operators to the next query optimization developer you see ;).

So, SQL Server has a small number of operations that in previous releases of the product were "incorrectly" put into regular projects that are now in sequence projects in the current versions of the product.  (By "incorrectly", I mean that there were potential optimizations that could be run that would break their semantics, not that there were actively executing plans at a customer site that were incorrect.  In optimizer terminology, they model was not correctly preserving the semantics even though there were no rules that exposed this fault).

So, the challenge for you is think of some places _beyond sequence functions_ (RANK, DENSE_RANK, NTILE, ROW_NUMBER, ...) that use the sequence projection operator.  If you mail them to me with plans, I may be able to tell you more about what is happening for each of them.

Happy hunting!

Conor Cunningham

Categories:
SQL Server 2008

I've been trying to think of a good way to introduce how SQL Server does subqueries based on a question from a reader.  One of the tricky pieces to query optimization is that the "language" used internally in the optimizer is actually somewhat more expressive than SQL.  The second part of the problem is that, for people who have used other products at one time or another, it's a challenge to get them to re-learn something to grok how SQL Server does it (and even to figure out what they need to re-learn).

So I'm going to do a series of posts on subqueries and the related pieces behind subqueries to put them in context.

Let's start at the beginning.  A subquery is a fancy filter condition where you can specify an query to run for each row of the outer query.  This doesn't mean that's how the optimizer will actually execute it, but you can start thinking about it like that:

SELECT * FROM MyTable WHERE productkey IN (SELECT * FROM Orders where MyTable.productkey=Orders.productkey)


One way to execute this query is to read each row from MyTable, get productkey, start up another query to look through orders, and determine if we want to keep the row from MyTable based on the result of that second query.  This is how some of the other QP's on which I have worked attack subqueries and is perfectly valid.

In the internal lanaguage of the optimizer (and you can read this in output plans), you will typically see that this is executed physically in SQL Server using a "semi-join", which you should imagine as a join operator that does what we've described within one single query instead of running an extra query per row.  A semi-join will return rows from only one side of the join (left or right semi join).  These semijoins are nice because you can actually re-use your physical join implementation to execute them.  They are also nice because you can also leverage your tree rewrite logic and cardinality estimation logic to help you think about these like joins as well.  So, you can reorder and rewrite the query into alternative, equivalent forms that might run more quickly.

The nice part about this approach is that you get the benefit of a lot of work and experience estimating cardinality for joins, which helps the SQL Server Optimizer make decisions about these rewrites.  The bad part about this is that in queries with many joins or queries with lots and lots of data, your odds of getting a mistake in join cardinality/cost estimation go up and up, and your opportunity to get a plan that is far from optimal also go up.

My next post will cover some of these things in more detail - off to get my kid up.

Conor

Categories:
SQL Server 2008

This post builds upon my previous post on some of the various kinds of plans for the IN operator.  I will cover some of the query plan cache behavior associated with IN plans.

For this post we'll be using the following query which will let you look at the text of plans in the cache and also see how many plans are in the cache.  (One can use "dbcc freeproccache" to clean out the cache whenever you are testing this - please don't do this on a production system unless you know what you are doing)

(Hey, I figured out how to post code properly in the blog.  and there was much rejoicing! yeah!)
SELECT text, plan_handle, * FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
CROSS APPLY sys.dm_exec_text_query_plan(qs.plan_handle, qs.statement_start_offset, qs.statement_end_offset) AS qp

OK, the two things of interest are the number of rows returned and the text of each plan.  I will note that the data in these sys.dm_exec* functions is "mostly up to date", but it doesn't follow an exact transactional semantic - so, you may need to wait a second or two to get the "proper" state of the plan cache after each step.

If you run dbcc freeproccache, then this query, you'll see one entry in the result (for the query you ran to get the status).  If we run query 1 from our previous post (select * from t1 where col2 in (1)) and look at the result, we get something interesting:
(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1

Hey, what's this?  Well, this is called "auto-parameterization" in SQL Server, and it happens on Trivial Plans (see previous post).  The idea is that SQL Server has a fancy query processor, but fancy means powerful for big queries and perhaps not the fastest on the small stuff.  In general, we're not saying that it's slow, but if you are compiling thousands of these things right in a row, you may notice a difference.  So, this autoparameterization feature is an attempt to reduce compilation time by finding plans that have no cost-based decisions and then templatize-ing them so that all other queries that look the same can re-use the plan.  This has a huge advantage for customers with query-generation components, as they often will generate many, many queries that only differ by the constants.  So, SQL Server uses this on trivial plans.

If you run the query "select * from t1 where col2 in (2)" (so, the constant is changed but nothing else), then run our big fancy cross apply query, then you see 2 query plans instead of 3.  Both of our new queries use the following plan:

(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1

Now, running queries with more than one element in the IN list generates a plan for each of them.  (I don't recall exactly why off-hand, but I'll think about it some more)

So, you can see the plan entries for a range of variations.

select  * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2,3,4,5,6,7,8,9,10)
(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1
select * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2)
select * from t1 where col2 =1 or col2=2

I'll point out that, except for trivial plans that are auto-parameterized, the plan text is used as the "key" into this cache.  It does not logically compare the output plans or even a normalized form of the input.  So, you'll notice that "
select  * from t1 where col2 in (2,1)" is in the cache twice because I put a space in the second time I ran that query.  So, you need to consider this when we get to the original question about plan cache pollution.  If your query generation engine (LINQ or whatever) adds comments into the query text sent to SQL Server, that could very well lead to more plans in the cache.

The more complex queries from my last post also generate unique plan entries in the same way, so I'll skip those here.

The next interesting factor to consider is forced parameterization.  This was added in SQL Server 2005, and it is both powerful and dangerous.  This feature is a database level feature, most literals are converted to paramters before query optimization in most query plans.  This is covered under the "Forced Parameterization" topic in SQL Server books online, so I'll suggest you read that if you want all the details.  I did a post or two on plan sensitivity to different parameter values, so beware that in some cases you may end up with query plans that are suboptimal and are stuck in the cache.  So, you should experiment with this setting to make sure that it really does benefit your application before you turn it on.

As I like to live dangerously, I will ignore all of these warnings and proceed with great vigor in the name of Science!

Let's try three queries that yielded separate plans without forced parameterization enabled:
select * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2)
select  * from t1 where col2 in (2,1)
Now we get one entry in the cache, and it looks similar to those auto-parameterized queries:
(@0 int,@1 int)select * from t1 where col2 in ( @0 , @1 )

Notice that even spaces are "fixed" in terms of cache hits in this mode.  The query is parsed and then the text is generated from the parsed tree, which gives one the ability to normalize the input SQL for the plan cache lookup.

Adding a 3-element IN list query generates a new plan in the cache:
select * from t1 where col2 in (1,2,3)
(@0 int,@1 int,@2 int)select * from t1 where col2 in ( @0 , @1 , @2 )
(@0 int,@1 int)select * from t1 where col2 in ( @0 , @1 )

OK, so we can get a plan per number of elements, which perhaps has its advantages in some cases.  Good to know.

I also tried our large cardinality IN list example (the one that generated the semi-join in my previous post):
(@0 int,@1 int,@2 int,@3 int,@4 int,@5 int,@6 int,@7 int,@8 int,@9 int,@10 int,@11 int,@12 int,@13 int,@14 int,@15 int,@16 int,@17 int,@18 int,@19 int,@20 int,@21 int,@22 int,@23 int,@24 int,@25 int,@26 int,@27 int,@28 int,@29 int,@30 int,@31 int,@32 int,@33 int,@34 int,@35 int,@36 int,@37 int,@38 int,@39 int,@40 int,@41 int,@42 int,@43 int,@44 int,@45 int,@46 int,@47 int,@48 int,@49 int,@50 int,@51 int,@52 int,@53 int,@54 int,@55 int,@56 int,@57 int,@58 int,@59 int,@60 int,@61 int,@62 int,@63 int,@64 int,@65 int)select * from t1 inner join t2 on t1 . col1 = t2 . col1 where t1 . col2 in ( @0 , @1 , @2 , @3 , @4 , @5 , @6 , @7 , @8 , @9 , @10 , @11 , @12 , @13 , @14 , @15 , @16 , @17 , @18 , @19 , @20 , @21 , @22 , @23 , @24 , @25 , @26 , @27 , @28 , @29 , @30 , @31 , @32 , @33 , @34 , @35 , @36 , @37 , @38 , @39 , @40 , @41 , @42 , @43 , @44 , @45 , @46 , @47 , @48 , @49 , @50 , @51 , @52 , @53 , @54 , @55 , @56 , @57 , @58 , @59 , @60 , @61 , @62 , @63 , @64 , @65 )


Yes, it seems to keep on going.  For once, I read the documentation and determine that one of the disqualifying conditions is:

Statements that contain more than 2,097 literals that are eligible for parameterization.

So, it will stop at some point based on how many literals you have in your query.

So while we're not up to LINQ yet, I think we have a pretty good basis to evaluate whether LINQ or any other query generator is going to cause plan cache issues.  I'll try to do some research on the shipping version of LINQ and see what I can determine by examining the plans that get generated.

That's enough for one evening from me.

Thanks,

Conor

Categories:
SQL Server 2008

I've already received a few comments and mails from people asking questions...

One comment I've received is about LINQ's IN operator(method?) and SQL Server plan cache pollution.  I'm researching the issue a bit before I post a complete answer, but I'll start with some background information on memory management in SQL Server.  I co-authored a paper on a new memory management algorithm that I helped develop that went into SQL Server 2005.  I don't think it has changed much in SQL Server 2008 from what I've seen of the November CTP at least.

The interesting thing about memory in SQL Server is that, unlike some of the other vendors in the database space, SQL Server runs all of its users within the same server process.  (This derives from UNIX traditionally using fork() to create another server process to service each user).  Windows has better thread support, and as a result it was possible to make SQL Server run all of the user activities within a single process.  This has advantages because you can avoid inter-process communication for a lot of things, like a plan cache.  You can also "be smart" on things like which activity to schedule next in ways that an Operating System will not do.  Often this can lead to large performance gains.

One downside of this single-process approach is that, at least on 32-bit machines, virtual memory space becomes a limited resource, especially on larger deployments.  Every component has to fit into about 1.4GB of space (with the remainder taken up by stacks for threads, address space for mapping DLLs into the process, etc).  The limitation does not exist on 64-bit machines, which is why I recommend that all new serious SQL Server deployments closely consider their architecture when purchasing new machines.

This paper was accepted at the Conference on Innovative Data Systems Research (CIDR), of which Jim Gray was one of the founding organizers.  It has some details in the kinds of problems that were solved and the performance gains that were achieved by more actively managing the memory consumption and consumption rates of each internal subcomponent.  In a sense, this is how one could build a special-purpose memory management system for a machine that only ran a database.

Memory-CIDR2007Accepted2.doc (400 KB)

I will start working on some posts on some of the interesting things that happen in the query processor to handle IN clauses (in SQL, not LINQ).  Then I'll do some work with LINQ's IN equivalent and see what interesting things I can post. 

Thanks,

Conor Cunningham

Categories:
SQL Server 2008

Here is the introductory post to my new blog on SQL, data-driven applications, performance, scaling, and anything else I feel like posting.  For those of you who do not know me, Paul Randal (http://www.sqlskills.com/blogs/paul/) and I worked together on Microsoft SQL Server for a number of years.  While Paul worked in the Storage Engine, I worked on the Query Processor.  Like Paul, I left Microsoft about a year ago (all good things must come to an end...).  Paul and Kimberly have been gracious enough to host me on their site so I can help answer questions about SQL Server, teach people a bit about what I know, and force me to learn a few things as well.

While at Microsoft, I was a development lead on the Query Optimizer and had architectural dutues for much of the query pipeline.

I actually blogged in my official capacity at Microsoft.  You can read about  my previous posts  here (http://blogs.msdn.com/queryoptteam/) and here (http://blogs.msdn.com/sqlqueryprocessing/default.aspx).

I'll be posting some content about the new SQL Server 2008 features I find interesting over the coming weeks.  I'm also happy to answer general questions about the Query Processor or SQL Server in general.

So, welcome!

Sincerely,
Conor Cunningham

Categories:
SQL Server 2008

Theme design by Nukeation based on Jelle Druyts