Wednesday, April 23, 2008

I spent some time playing with the SQL 2008 geometry data type.  This post will cover some of the things I have learned about how it is implemented from playing with the feature.  Some of these will impact the feature's usability for some.  Others may not care.

1. late-binding of the geometry data is an interesting choice.  So one of the nice properties of a SQL query is that, for many things, the parse and bind of a query can find all sorts of compilation errors that would show up during development.  You don't have to run the query in order to see the simple mistakes, and that helps us all.  This type doesn't throw parsing errors until you try to execute the query.  That's not perfect - you have to go execute spatial queries to see whether you have typed in lots of complex data correctly or not.  I can speak from experience that I tend to mess that stuff up.

Example:

create table g1 (col1 int identity, col2 geometry)
go
insert into g1(col2) values (Geometry::STGeomFromText('LINESTRING(0 0, 10 10, 21 2)',0));
go
insert into g1(col2) values (Geometry::STGeomFromText('LINESTRING(funky chicken)',0));
go
That third query compiles just fine - running it returns an error.

2. The geography type uses some form of the .net framework.  (I'll point out that I have not enabled the CLR in my server).  Given that the ordpath type requires that the CLR be enabled, I'll guess that this means that there is a special implementation of the geometry type.

Here's the runtime errof from the previous query:

Msg 6522, Level 16, State 1, Line 1
A .NET Framework error occurred during execution of user-defined routine or aggregate "geometry":
System.FormatException: 24141: A number is expected at position 16 of the input. The input has funky.
System.FormatException:
   at Microsoft.SqlServer.Types.OpenGisWktReader.RecognizeDouble()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParsePoint()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseLineStringText(FigureAttributes attributes)
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseLineStringTaggedText()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseGeometryTaggedText()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ReadGeometry()
   at Microsoft.SqlServer.Types.SqlGeometry.STGeomFromText(SqlChars geometryTaggedText, Int32 srid)
.
The statement has been terminated.

(I think I will be making t-shirts that say "The input has funky" on it... let me know if you want in :).

I am actually pretty happy with the error message - the only thing missing is which column caused the error.  You'll notice that "position 16" is inside of this second string being parsed by the .NET routines.  I personally do find stack traces unacceptable for general use, but the .net and java folks seem to have doubled down on this approach, so we all get to learn about their libraries even in SQL now...

This is just a seam in how it was implemented.  Be aware of this when using the type.

3. the geometry type has been (arbitrarily) given an estimated row width of about 4000 bytes.  well, this is what shows up in showplan.  So I inserted a bunch of data into this table (1.4 million rows in about 148MB).  The cost estimates don't seem to use the 4000 byte number for the row width (as it would take a long time to read 5.6GB - more than 13 seconds or so, given that is what the costing says, so I am guessing that this is just reported in showplan the same way it is costed internally).  I'm just using a heap (no indexes) for this experiment.

4. I'm waiting for maps.live.com  to make it easy for me to geocode addresses and insert them into SQL.  It has a feature to send it to some gps device, but I haven't played with it beyond learning that it didn't detect that I had blocked cookies by default ;).

5. "select col2.ToString() from g1" is a way to see something useful from that data type.  otherwise you get back binary goo...

That's all for tonight.

Thanks,

Conor Cunningham





Wednesday, April 23, 2008 7:58:53 PM (Central Standard Time, UTC-06:00)  #    Comments [0]  | 
Wednesday, April 16, 2008
I started off tonight playing with the new page compression feature.  So far I like it.  I haven't found something yet about which I want to post (which is code for "I'm still looking for the seams ;)"), but I have some other things you can try to learn a few things about how the SQL Server QP makes assumptions about various kinds of predicates in queries during cardinality estimations.

So, you may or may not know much about how the SQL Server QP figures out what plans to run.  For the uninitiated, it almost seems like some form of magic.  In some ways, it is - it's very powerful and poorly understood by many, and it usually requires very little effort by someone skilled in the area to make something amazing happen.  SQL Server merely needs to make itself sparkle when I fix a query plan and I'm set for life :).

In reality, SQL Server uses a cost-based optimizer, which means that it keeps track of all sorts of interesting statistical information, row counts, page counts, etc.  It uses all of these in formulas to come up with numbers for each plan fragment and then it weighs the relative costs of all of these to pick a plan that has the "least cost".  That sounds nice and absolute until you get to go actually try to make that work, and then you are left with all sorts of nasty questions like:

* What should the cost formulas be?
* Do the numbers need to differ based on the customer's hardware? How do we calibrate all of this stuff?  What do we do as machines get faster?
* How do I estimate how many rows are going to come back from one predicate in my WHERE clause or join in time less than running the query to figure it out?
* Same question when I have a bunch of preciates?

Eventually, the QP has to make a set of assumptions so that they can come up with a plan in a reasonable amount of time, both because customers don't like things to ever take time and because managers don't like customers to tell them about how much time something should take..  One assumption might be that, data is uniformly distributed over a data type's possible values when you don't have any better information.  This can help make it possible to come up with solutions that work well most of the time.  The problem is that estimates can be wrong, and that can cause the QP to come up with a different plan than had it had correct information to use when selecting the plan.

So, I'll show you an example here.  To be clear, I'm not saying that this is something that is "broken".  This just exposes a place where 2 different assumptions rub up against each other in a way that will SEEM odd to the outside observer.  When you consider the average customer use cases, these assumptions are not bad and work very well the vast majority of the time...

To the example:
drop table comp1
create table comp1(col1 int identity, col3 nvarchar(3000))

declare @i int 
set @i=0
while @i < 70000
begin
insert into comp1 (col3) values (N'123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890')
set @i=@i+1
end
So I create a table with the same long string in it 70,000 times. 

Then I run a query with a where clause just to get some statistics created:
select  * from comp1 where col3 like '%4567890%'
dbcc show_statistics ('comp1', col3)

Once we have all of this stuff, we can look at the estimates for two very similar queries:

select  * from comp1 where col3 like '123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890%'

select  * from comp1 where col3 like '%123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890%'
(one is LIKE 'abc%'.  The other is LIKE '%abc% where abc is the value we have inserted 70,000 times).

So, both queries will return 70,000 rows. 

Well, the abc% pattern query estimates 70,000 rows (good!).  The second query estimates 69.6985 rows.  that's a bit under ;).

Let's talk about this a bit more so you can understand why.  In the first query, there is an exact string match against a column represented in the statistics histogram.  So, the likely outcome is to take that cardinality count to determine the number of rows that will likely be returned from the query.  In this case, we expect all rows to come back.

In the second one, there is no mechanism to estimate the cardinality of ANY string of this size (SQL Server does have a feature that does smaller strings which is exposed as the "STRING INDEX" in the histogram, but you can't see the details of this object in 2005 and I haven't seen that change in 2008 either).  So, for really large strings, it is left with... guessing.

So, that 69.6985 number is an estimate that is partially based on the length of the string.  Now, the QP could try to walk through each statitistics object and try to find substrings against any existing piece of statistical data and then try to adjust its estimate.  In practice, though, the cost of doing that is expensive.  The various statistics objects are run at different times and have different sample rates, so even then they will vary somewhat.  Finally, for most cases it may just not impact the plan choice that much.  Odds are, though, that this will bite at least one of my readers at some point.  So, this is good to know - it can help you find that spot where the assumption in the QP is causing your query plan to be wrong.  This is the sort of case where you will want to consider a query hint to help the QP out.

There are more assumptions (and seams between them) in the cardinality estimation code.  I'll let you guys go hunt for them a bit to find them.

Happy Querying!
Conor Cunningham




Wednesday, April 16, 2008 8:54:35 PM (Central Standard Time, UTC-06:00)  #    Comments [0]  | 
Sunday, March 23, 2008
So I was playing today with the sparse column implementation in the latest SQL 2008 CTP, and I was specifically looking to see how this thing would work with the query processor.  I wanted to know how much information is exposed for the QP to make smart decisions when this functionality is used in complex queries with cost-based plan choices.  If the QP doesn't have the information, then sometimes the query plans will be sub-optimal because, well, garbage-in garbage-out.  While the SQL Server QP does a tremendous job at making complex plan choices compared to some of the other commercial alternatives, there are still limits on what the Optimizer can model in a reasonable amount of time.  As such, there are seams where the product tends to not work as well as one would hope.  This will always be true.  While I suppose that will also keep me employable, it is useful to understand those limits because it will help you know where to look or, if it's really hard, when to ask for help.

The SQL Server QP knows a couple of things about the data stored in a table in the storage engine:
1. How many physical pages it uses
2. How many rows it has in it (approximately)
3. Single-column statistics over a sample of the data
4. A basic notion of column interdependance to help in estimating queries with multiple scalar predicates.

From 1 and 2 it can derive the average row width.  That's useful for determining things like "how big will my sort be" if the query needs to sort.  That's a good thing - it leads to reasonable estimates for many choices in the QP.

So let's add sparse columns into the mix.  Sparse columns are useful for data with lots of NULLs.  Often this is a result of a non-traditional third-normal form database problem or, perhaps someone who is not a database person not really trying to make something into a database problem early enough in its lifecycle.  The point is that commercial database systems have a sweet spot around handling data sets with known (and small) sets of columns that can be stored in tables.  There is a TON of expressiveness available in query processors that manipulate this data because this format of data is better supported than other formats.

None of this really means that your problem is going to easily fit into a nice third-normal form system.  Often there are legacy or performance concerns that push an application away from that sweet spot.  Over time, various technologies have tried to bridge that gap (property tables, XML, and object-relational mappings).  Each of them have their own reasons to be, and I don't want to get into them in depth in my post.  I'm going to talk about how the QP deals with these from a modeling perspective.

I built two examples to explore how SQL Server 2008 reasons about sparse columns.  One example creates lots of traditional, nullable float columns while the other is exactly the same except that it uses the sparse attribute.

A few things I learned immediately:
1. Sparse columns don't change the maximum number of columns you can create in a table.  On the surface, this seems unfortunate, since it will limit the kinds of applications that can use the feature. 
2. It does seem to use less space per row.  This isn't hard, as the row format for SQL Server has a null bitmap and also needs 2 bytes per column to store the variable offset pointers.

create table sp1(aaa int)
create table sp2(aaa int)

declare @i int
set @i=0
while (@i < 990)
begin
declare @sql nvarchar(400);
declare @s nvarchar(20);
set @s = @i;
set @sql = 'alter table sp1 add col' + @s + ' float sparse'
exec sp_executesql @sql
set @i=@i+1
end

declare @i int
set @i=0
while (@i < 990)
begin
declare @sql nvarchar(400);
declare @s nvarchar(20);
set @s = @i;
set @sql = 'alter table sp2 add col' + @s + ' float'
exec sp_executesql @sql
set @i=@i+1
end
declare @i int
set @i=0
while @i < 20000 
begin
insert into sp1(col2) values (123.4)
set @i=@i+1
end

declare @i int
set @i=0
while @i < 20000 
begin
insert into sp2(col2) values (123.4)
set @i=@i+1
end
If we run "set statistics io on" and then run "select * from sp1" and "select * from sp2", you'd like to see some difference in IOs:

sp1:
(20000 row(s) affected)
Table 'sp1'. Scan count 1, logical reads 86, physical reads 0, read-ahead reads 80, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

sp2:
(20000 row(s) affected)
Table 'sp2'. Scan count 1, logical reads 20000, physical reads 1, read-ahead reads 19978, lob logical reads 0, lob physical reads 0, lob read-ahead reads 0.

Well, that's good - the sparse format on largely sparse data saves space.  We can confirm that with a quick look into the system tables:
SELECT o.name AS table_name, au.type_desc, au.used_pages
FROM sys.allocation_units AS au
    JOIN sys.partitions AS p ON au.container_id = p.partition_id
    JOIN sys.objects AS o ON p.object_id = o.object_id
WHERE o.name in (N'sp1', N'sp2')
table_name                                                                                                                       type_desc                                                    used_pages
-------------------------------------------------------------------------------------------------------------------------------- ------------------------------------------------------------ --------------------
sp1                                                                                                                              IN_ROW_DATA                                                  87
sp1                                                                                                                              ROW_OVERFLOW_DATA                                            0
sp2                                                                                                                              IN_ROW_DATA                                                  20001

(3 row(s) affected)

We've now confirmed that we actually do have fewer pages.  This is also good.

Now let's see how far into the QP this extends.  Does the QP model the costs for these two queries differently?

SP1 TotalSubtreeCost: 0.08824496
SP2 TotalSubtreeCost: 14.83936

And that, my friends, is a "good thing".  This means that sparse columns are going to help your complex queries when you use a table with sparse columns in it.  The easiest way to implement this is to simply ignore the new feature in the QP, and obviously someone did a good job to make sure that it was costed properly. 

I don't believe that there are additional statistical structures to tell the QP which things are on/off row.  This will show up in a small number of scenarios (similar to how LOB data can be on/off row).  This is outside of the model for how the QP reasons about plan cost, at least from what I've seen from SQL 2008 and from what was publicly said about 2005.

Thanks all,

Conor Cunningham


Sunday, March 23, 2008 7:35:01 PM (Central Standard Time, UTC-06:00)  #    Comments [1]  | 
Friday, March 14, 2008
You may recall my previous posts on my trouble with SQL 2008 CTP6.  I've made some progress on fixing my machine that I thought I'd share with you.  I now get past the following error:

D:\temp\sqlctp6\servers>setup
The following error occurred:
MsiGetProductInfo failed to retrieve ProductVersion for package with Product Code = '{00E75F61-A126-4CE1-90B8-42295052F1AC}'. Error code: 1605.

I useded the SysInternals (err, Microsoft) Process Monitor Tool and watched for keys found/not found during the failed install.  This found a few keys in HKCR\Installer\UpgradeCodes that were being found early in the setup100.exe process.

(Fair notice - modifying the registry on your computer is your problem, not mine :).

So take the key:
00E75F61

Reverse it.  I think I had 615fe700, but it was late and I was tired.  It might have been 16f57e00.  Anyways, you will see some keys under HKCR\Installer\UpgradeCodes.  There are actually 2-3 places in the registry searched for each key  I've been killing all three of them for each key - there are about 7-10 keys.  The registry section looks like this:



The hex digits you have in the error will correspond to the right hand side of this picture.  The key I've been deleting is it's parent, which is the key being opened in the "key found"/"key not found" stuff in the process monitor log.

Here's what the log looked like for me:

136791  11:53:49.4922341 PM      setup100.exe    2812       CloseFile              D:\temp\sqlctp6\tools\setup\sqlrun_bids.msi                SUCCESS             

136793  11:53:49.4923712 PM      setup100.exe    2812       WriteFile              C:\Program Files (x86)\Microsoft SQL Server\100\Setup Bootstrap\Log\20080313_2353\WOPR_20080313_2353_Detail_ComponentUpdateSetup.txt SUCCESS                Offset: 7,591, Length: 62

136795  11:53:49.4956628 PM      setup100.exe    2812       RegOpenKey                HKLM\Software\Microsoft\Windows\CurrentVersion\Installer\Managed\S-1-5-21-2888934283-224128331-3030229123-1000\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4          NAME NOT FOUND         Desired Access: Read

136796  11:53:49.4957796 PM      setup100.exe    2812       RegOpenKey                HKCU\Software\Microsoft\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4      NAME NOT FOUND                Desired Access: Read

136797  11:53:49.4958199 PM      setup100.exe    2812       RegOpenKey                HKCR\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4 SUCCESS              Desired Access: Read

136798  11:53:49.4958553 PM      setup100.exe    2812       RegEnumValue                HKCR\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4 SUCCESS              Index: 0, Name: 16F57E00621A1EC4098B249205251FCA, Type: REG_SZ, Length: 2, Data:

136799  11:53:49.4958960 PM      setup100.exe    2812       RegOpenKey                HKLM\Software\Microsoft\Windows\CurrentVersion\Installer\Managed\S-1-5-21-2888934283-224128331-3030229123-1000\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4          NAME NOT FOUND         Desired Access: Read

136800  11:53:49.4959398 PM      setup100.exe    2812       RegOpenKey                HKCU\Software\Microsoft\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4      NAME NOT FOUND                Desired Access: Read

136801  11:53:49.4959604 PM      setup100.exe    2812       RegCloseKey                HKCR\Installer\UpgradeCodes\87674BD65E9A5D1409951D671E37BDA4 SUCCESS             

136802  11:53:49.5146648 PM      setup100.exe    2812       RegOpenKey                HKLM\Software\Microsoft\Windows\CurrentVersion\Installer\Managed\S-1-5-21-2888934283-224128331-3030229123-1000\Installer\Products\16F57E00621A1EC4098B249205251FCA        NAME NOT FOUND         Desired Access: Read

Now the installer gets past this and tries to install the engine and then fails, but I will call this progress ;).

Notice - I had deleted all of my physical files for SQL Server from the machine, so killing the registry keys seemed like a reasonable next step.  I can't promise you it's a good idea since I don't have things working yet, but I hope this helps the many of you who mailed me and found me via search engines.

Thanks,

Conor Cunningham
Friday, March 14, 2008 5:15:30 PM (Central Standard Time, UTC-06:00)  #    Comments [1]  | 
Saturday, March 08, 2008
So as you proceed up the river into the jungle, searching for answers about how the query optimizer works, I'll ask you one question: Did you know that there's actually a lot of stuff that the Optimizer team just tells you?  It's in the product, and I'm constantly suprised by how little attention they get.  You can learn all sorts of things by looking at the data.  Now, not all of it is documented, but that's usually just because one needs leeway to change the internals rather than some deep, dark secret that needs to be kept.

Now that you're hooked ;), I'll tell you a bit about an optimizer DMV called sys.dm_exec_query_optimizer_info.  It tells you all sorts of things about how your querys are optimized.  It's a bag of counters for all sorts of things that, with a trained eye, can give you lots of insight into what is happening.

This guy is actually documented, at least partially, on MSDN. 




(That's from my SQL 2008 install, btw).

Some of these fields are "undocumented".  search 0, 1, and 2 are in that category.  I won't talk about them except to say that the names aren't really obfuscated too much. 

To learn about a particular query, you find a nice, quiet server and:

1. select * from this table, store the results somewhere
2. optimize a query of interest,
3. select from this table again, then compare the current totals to the originals.

I think that this is one use of this DMV - trying to figure out why a query takes a long time to optimize.

The other use of the DMV is to get a good statistical picture of a running system.  Say that I'm a DBA and I want to know how many queries in my application have hints or _need_ hints to work well.  Well, this will tell you.  Granted, it doesn't separate recompiles from compiles, and if you have a system where plans are getting kicked out of the cache things may be a bit skewed, but I can tell you that this is far better than simply guessing.  Often the DB application developer doesn't realize that they've built an application that requires a lot of hinting or a lot of compilations, and you can see this in more detail than you get with the performance counters.

I've already talked about "trivial plans", which are not documented in this DMV but are widely known in the other outputs of the system.  I'll let you guys guess about the search 0, 1, and 2 stuff - if you can back up your guess with a public post, book, or other form of comment I'll confirm if you get it right.

Have a great weekend, ya'll.

Conor Cunningham
Saturday, March 08, 2008 3:41:56 PM (Central Standard Time, UTC-06:00)  #    Comments [2]  | 
Tuesday, January 29, 2008
So I'll be posting explanations for each query operator in the output query trees that you can see through "set showplan_text on" in SQL Server.  I'm hopeful that this will give you the tools to better read the query plans being generated by the QP when evaluating your system.

So I'll start with one that I added into the optimizer - the "sequence project".

create table z(col1 int identity, col2 int default rand()*10000)

declare @i int 
set @i=0
while @i < 1000
begin
insert into z default values
set @i=@i+1
end

select RANK() over(order by col2), * from z



If you look at the graphical showplan you get an output plan that looks a bit funny.  A Sequence Project is a special form of projection (the thing that evaluates the scalars in the SELECT list for each row).  By default, a regular project in SQL Server does not preserve nor guarantee any sort properties in the physical plan.  As a result, ranking functions put in regular projects would not preserve order in all cases deterministically.  People seem to complain when the query results are incorrect.

Enter the sequence project....This little beauty does do scalar computations, but they are generally limited to a very small number of operations that require order for one reason or another.  In the case of ranking functions, an order is required as part of the syntax, and you will see that this particular plan needs a sort since there are no indexes on the base table to satisfy a sort request on col2.  (I'll talk about segments in a future post, but for now ignore them and assume that they are part of the sequence project when looking at this plan).

The problem with sequence projects are that while they preserve ordering semantics, they interfere with a lot of the other optimization logic in the QP.  For example, these can interfere with index matching for predicates above the sequence project in a query tree.  Operators that block other optimizations tend to make optimizer developers very unhappy and sometimes elicit violent reactions!  So, be careful when proposing "opaque" operators to the next query optimization developer you see ;).

So, SQL Server has a small number of operations that in previous releases of the product were "incorrectly" put into regular projects that are now in sequence projects in the current versions of the product.  (By "incorrectly", I mean that there were potential optimizations that could be run that would break their semantics, not that there were actively executing plans at a customer site that were incorrect.  In optimizer terminology, they model was not correctly preserving the semantics even though there were no rules that exposed this fault).

So, the challenge for you is think of some places _beyond sequence functions_ (RANK, DENSE_RANK, NTILE, ROW_NUMBER, ...) that use the sequence projection operator.  If you mail them to me with plans, I may be able to tell you more about what is happening for each of them.

Happy hunting!

Conor Cunningham
Tuesday, January 29, 2008 8:06:04 PM (Central Standard Time, UTC-06:00)  #    Comments [0]  | 
Sunday, January 27, 2008
I've been trying to think of a good way to introduce how SQL Server does subqueries based on a question from a reader.  One of the tricky pieces to query optimization is that the "language" used internally in the optimizer is actually somewhat more expressive than SQL.  The second part of the problem is that, for people who have used other products at one time or another, it's a challenge to get them to re-learn something to grok how SQL Server does it (and even to figure out what they need to re-learn).

So I'm going to do a series of posts on subqueries and the related pieces behind subqueries to put them in context.

Let's start at the beginning.  A subquery is a fancy filter condition where you can specify an query to run for each row of the outer query.  This doesn't mean that's how the optimizer will actually execute it, but you can start thinking about it like that:

SELECT * FROM MyTable WHERE productkey IN (SELECT * FROM Orders where MyTable.productkey=Orders.productkey)


One way to execute this query is to read each row from MyTable, get productkey, start up another query to look through orders, and determine if we want to keep the row from MyTable based on the result of that second query.  This is how some of the other QP's on which I have worked attack subqueries and is perfectly valid.

In the internal lanaguage of the optimizer (and you can read this in output plans), you will typically see that this is executed physically in SQL Server using a "semi-join", which you should imagine as a join operator that does what we've described within one single query instead of running an extra query per row.  A semi-join will return rows from only one side of the join (left or right semi join).  These semijoins are nice because you can actually re-use your physical join implementation to execute them.  They are also nice because you can also leverage your tree rewrite logic and cardinality estimation logic to help you think about these like joins as well.  So, you can reorder and rewrite the query into alternative, equivalent forms that might run more quickly.

The nice part about this approach is that you get the benefit of a lot of work and experience estimating cardinality for joins, which helps the SQL Server Optimizer make decisions about these rewrites.  The bad part about this is that in queries with many joins or queries with lots and lots of data, your odds of getting a mistake in join cardinality/cost estimation go up and up, and your opportunity to get a plan that is far from optimal also go up.

My next post will cover some of these things in more detail - off to get my kid up.

Conor
Sunday, January 27, 2008 7:59:43 AM (Central Standard Time, UTC-06:00)  #    Comments [3]  | 
Tuesday, January 22, 2008
This post builds upon my previous post on some of the various kinds of plans for the IN operator.  I will cover some of the query plan cache behavior associated with IN plans.

For this post we'll be using the following query which will let you look at the text of plans in the cache and also see how many plans are in the cache.  (One can use "dbcc freeproccache" to clean out the cache whenever you are testing this - please don't do this on a production system unless you know what you are doing)

(Hey, I figured out how to post code properly in the blog.  and there was much rejoicing! yeah!)
SELECT text, plan_handle, * FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
CROSS APPLY sys.dm_exec_text_query_plan(qs.plan_handle, qs.statement_start_offset, qs.statement_end_offset) AS qp

OK, the two things of interest are the number of rows returned and the text of each plan.  I will note that the data in these sys.dm_exec* functions is "mostly up to date", but it doesn't follow an exact transactional semantic - so, you may need to wait a second or two to get the "proper" state of the plan cache after each step.

If you run dbcc freeproccache, then this query, you'll see one entry in the result (for the query you ran to get the status).  If we run query 1 from our previous post (select * from t1 where col2 in (1)) and look at the result, we get something interesting:
(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1

Hey, what's this?  Well, this is called "auto-parameterization" in SQL Server, and it happens on Trivial Plans (see previous post).  The idea is that SQL Server has a fancy query processor, but fancy means powerful for big queries and perhaps not the fastest on the small stuff.  In general, we're not saying that it's slow, but if you are compiling thousands of these things right in a row, you may notice a difference.  So, this autoparameterization feature is an attempt to reduce compilation time by finding plans that have no cost-based decisions and then templatize-ing them so that all other queries that look the same can re-use the plan.  This has a huge advantage for customers with query-generation components, as they often will generate many, many queries that only differ by the constants.  So, SQL Server uses this on trivial plans.

If you run the query "select * from t1 where col2 in (2)" (so, the constant is changed but nothing else), then run our big fancy cross apply query, then you see 2 query plans instead of 3.  Both of our new queries use the following plan:

(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1

Now, running queries with more than one element in the IN list generates a plan for each of them.  (I don't recall exactly why off-hand, but I'll think about it some more)

So, you can see the plan entries for a range of variations.

select  * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2,3,4,5,6,7,8,9,10)
(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1
select * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2)
select * from t1 where col2 =1 or col2=2

I'll point out that, except for trivial plans that are auto-parameterized, the plan text is used as the "key" into this cache.  It does not logically compare the output plans or even a normalized form of the input.  So, you'll notice that "
select  * from t1 where col2 in (2,1)" is in the cache twice because I put a space in the second time I ran that query.  So, you need to consider this when we get to the original question about plan cache pollution.  If your query generation engine (LINQ or whatever) adds comments into the query text sent to SQL Server, that could very well lead to more plans in the cache.

The more complex queries from my last post also generate unique plan entries in the same way, so I'll skip those here.

The next interesting factor to consider is forced parameterization.  This was added in SQL Server 2005, and it is both powerful and dangerous.  This feature is a database level feature, most literals are converted to paramters before query optimization in most query plans.  This is covered under the "Forced Parameterization" topic in SQL Server books online, so I'll suggest you read that if you want all the details.  I did a post or two on plan sensitivity to different parameter values, so beware that in some cases you may end up with query plans that are suboptimal and are stuck in the cache.  So, you should experiment with this setting to make sure that it really does benefit your application before you turn it on.

As I like to live dangerously, I will ignore all of these warnings and proceed with great vigor in the name of Science!

Let's try three queries that yielded separate plans without forced parameterization enabled:
select * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2)
select  * from t1 where col2 in (2,1)
Now we get one entry in the cache, and it looks similar to those auto-parameterized queries:
(@0 int,@1 int)select * from t1 where col2 in ( @0 , @1 )

Notice that even spaces are "fixed" in terms of cache hits in this mode.  The query is parsed and then the text is generated from the parsed tree, which gives one the ability to normalize the input SQL for the plan cache lookup.

Adding a 3-element IN list query generates a new plan in the cache:
select * from t1 where col2 in (1,2,3)
(@0 int,@1 int,@2 int)select * from t1 where col2 in ( @0 , @1 , @2 )
(@0 int,@1 int)select * from t1 where col2 in ( @0 , @1 )

OK, so we can get a plan per number of elements, which perhaps has its advantages in some cases.  Good to know.

I also tried our large cardinality IN list example (the one that generated the semi-join in my previous post):
(@0 int,@1 int,@2 int,@3 int,@4 int,@5 int,@6 int,@7 int,@8 int,@9 int,@10 int,@11 int,@12 int,@13 int,@14 int,@15 int,@16 int,@17 int,@18 int,@19 int,@20 int,@21 int,@22 int,@23 int,@24 int,@25 int,@26 int,@27 int,@28 int,@29 int,@30 int,@31 int,@32 int,@33 int,@34 int,@35 int,@36 int,@37 int,@38 int,@39 int,@40 int,@41 int,@42 int,@43 int,@44 int,@45 int,@46 int,@47 int,@48 int,@49 int,@50 int,@51 int,@52 int,@53 int,@54 int,@55 int,@56 int,@57 int,@58 int,@59 int,@60 int,@61 int,@62 int,@63 int,@64 int,@65 int)select * from t1 inner join t2 on t1 . col1 = t2 . col1 where t1 . col2 in ( @0 , @1 , @2 , @3 , @4 , @5 , @6 , @7 , @8 , @9 , @10 , @11 , @12 , @13 , @14 , @15 , @16 , @17 , @18 , @19 , @20 , @21 , @22 , @23 , @24 , @25 , @26 , @27 , @28 , @29 , @30 , @31 , @32 , @33 , @34 , @35 , @36 , @37 , @38 , @39 , @40 , @41 , @42 , @43 , @44 , @45 , @46 , @47 , @48 , @49 , @50 , @51 , @52 , @53 , @54 , @55 , @56 , @57 , @58 , @59 , @60 , @61 , @62 , @63 , @64 , @65 )


Yes, it seems to keep on going.  For once, I read the documentation and determine that one of the disqualifying conditions is:

Statements that contain more than 2,097 literals that are eligible for parameterization.

So, it will stop at some point based on how many literals you have in your query.

So while we're not up to LINQ yet, I think we have a pretty good basis to evaluate whether LINQ or any other query generator is going to cause plan cache issues.  I'll try to do some research on the shipping version of LINQ and see what I can determine by examining the plans that get generated.

That's enough for one evening from me.

Thanks,

Conor
Tuesday, January 22, 2008 9:02:28 PM (Central Standard Time, UTC-06:00)  #    Comments [1]  | 
Saturday, January 19, 2008
I've already received a few comments and mails from people asking questions...

One comment I've received is about LINQ's IN operator(method?) and SQL Server plan cache pollution.  I'm researching the issue a bit before I post a complete answer, but I'll start with some background information on memory management in SQL Server.  I co-authored a paper on a new memory management algorithm that I helped develop that went into SQL Server 2005.  I don't think it has changed much in SQL Server 2008 from what I've seen of the November CTP at least.

The interesting thing about memory in SQL Server is that, unlike some of the other vendors in the database space, SQL Server runs all of its users within the same server process.  (This derives from UNIX traditionally using fork() to create another server process to service each user).  Windows has better thread support, and as a result it was possible to make SQL Server run all of the user activities within a single process.  This has advantages because you can avoid inter-process communication for a lot of things, like a plan cache.  You can also "be smart" on things like which activity to schedule next in ways that an Operating System will not do.  Often this can lead to large performance gains.

One downside of this single-process approach is that, at least on 32-bit machines, virtual memory space becomes a limited resource, especially on larger deployments.  Every component has to fit into about 1.4GB of space (with the remainder taken up by stacks for threads, address space for mapping DLLs into the process, etc).  The limitation does not exist on 64-bit machines, which is why I recommend that all new serious SQL Server deployments closely consider their architecture when purchasing new machines.

This paper was accepted at the Conference on Innovative Data Systems Research (CIDR), of which Jim Gray was one of the founding organizers.  It has some details in the kinds of problems that were solved and the performance gains that were achieved by more actively managing the memory consumption and consumption rates of each internal subcomponent.  In a sense, this is how one could build a special-purpose memory management system for a machine that only ran a database.

Memory-CIDR2007Accepted2.doc (400 KB)

I will start working on some posts on some of the interesting things that happen in the query processor to handle IN clauses (in SQL, not LINQ).  Then I'll do some work with LINQ's IN equivalent and see what interesting things I can post. 

Thanks,

Conor Cunningham
Saturday, January 19, 2008 11:28:32 AM (Central Standard Time, UTC-06:00)  #    Comments [0]  | 
Wednesday, January 16, 2008
Here is the introductory post to my new blog on SQL, data-driven applications, performance, scaling, and anything else I feel like posting.  For those of you who do not know me, Paul Randal (http://www.sqlskills.com/blogs/paul/) and I worked together on Microsoft SQL Server for a number of years.  While Paul worked in the Storage Engine, I worked on the Query Processor.  Like Paul, I left Microsoft about a year ago (all good things must come to an end...).  Paul and Kimberly have been gracious enough to host me on their site so I can help answer questions about SQL Server, teach people a bit about what I know, and force me to learn a few things as well.

While at Microsoft, I was a development lead on the Query Optimizer and had architectural dutues for much of the query pipeline.

I actually blogged in my official capacity at Microsoft.  You can read about  my previous posts  here (http://blogs.msdn.com/queryoptteam/) and here (http://blogs.msdn.com/sqlqueryprocessing/default.aspx).

I'll be posting some content about the new SQL Server 2008 features I find interesting over the coming weeks.  I'm also happy to answer general questions about the Query Processor or SQL Server in general.

So, welcome!

Sincerely,
Conor Cunningham

Wednesday, January 16, 2008 8:12:42 PM (Central Standard Time, UTC-06:00)  #    Comments [2]  | 

Theme design by Jelle Druyts

Pick a theme: