I spent some time playing with the SQL 2008 geometry data type.  This post will cover some of the things I have learned about how it is implemented from playing with the feature.  Some of these will impact the feature's usability for some.  Others may not care.

1. late-binding of the geometry data is an interesting choice.  So one of the nice properties of a SQL query is that, for many things, the parse and bind of a query can find all sorts of compilation errors that would show up during development.  You don't have to run the query in order to see the simple mistakes, and that helps us all.  This type doesn't throw parsing errors until you try to execute the query.  That's not perfect - you have to go execute spatial queries to see whether you have typed in lots of complex data correctly or not.  I can speak from experience that I tend to mess that stuff up.

Example:

create table g1 (col1 int identity, col2 geometry)
go
insert into g1(col2) values (Geometry::STGeomFromText('LINESTRING(0 0, 10 10, 21 2)',0));
go
insert into g1(col2) values (Geometry::STGeomFromText('LINESTRING(funky chicken)',0));
go
That third query compiles just fine - running it returns an error.

2. The geography type uses some form of the .net framework.  (I'll point out that I have not enabled the CLR in my server).  Given that the ordpath type requires that the CLR be enabled, I'll guess that this means that there is a special implementation of the geometry type.

Here's the runtime errof from the previous query:

Msg 6522, Level 16, State 1, Line 1
A .NET Framework error occurred during execution of user-defined routine or aggregate "geometry":
System.FormatException: 24141: A number is expected at position 16 of the input. The input has funky.
System.FormatException:
   at Microsoft.SqlServer.Types.OpenGisWktReader.RecognizeDouble()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParsePoint()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseLineStringText(FigureAttributes attributes)
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseLineStringTaggedText()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseGeometryTaggedText()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ReadGeometry()
   at Microsoft.SqlServer.Types.SqlGeometry.STGeomFromText(SqlChars geometryTaggedText, Int32 srid)
.
The statement has been terminated.

(I think I will be making t-shirts that say "The input has funky" on it... let me know if you want in :).

I am actually pretty happy with the error message - the only thing missing is which column caused the error.  You'll notice that "position 16" is inside of this second string being parsed by the .NET routines.  I personally do find stack traces unacceptable for general use, but the .net and java folks seem to have doubled down on this approach, so we all get to learn about their libraries even in SQL now...

This is just a seam in how it was implemented.  Be aware of this when using the type.

3. the geometry type has been (arbitrarily) given an estimated row width of about 4000 bytes.  well, this is what shows up in showplan.  So I inserted a bunch of data into this table (1.4 million rows in about 148MB).  The cost estimates don't seem to use the 4000 byte number for the row width (as it would take a long time to read 5.6GB - more than 13 seconds or so, given that is what the costing says, so I am guessing that this is just reported in showplan the same way it is costed internally).  I'm just using a heap (no indexes) for this experiment.

4. I'm waiting for maps.live.com  to make it easy for me to geocode addresses and insert them into SQL.  It has a feature to send it to some gps device, but I haven't played with it beyond learning that it didn't detect that I had blocked cookies by default ;).

5. "select col2.ToString() from g1" is a way to see something useful from that data type.  otherwise you get back binary goo...

That's all for tonight.

Thanks,

Conor Cunningham





Categories:
SQL Server 2008

A follow-up from my previous post.

Is COUNT(1) faster than COUNT(*)?

No, COUNT(1) is mapped into COUNT(*) early in the QP.  From the logic of my previous post, COUNT(1) is always non-null and thus is the same as just doing a COUNT(*).  Since it is semantically equivalent to COUNT(*), the QP just maps it to COUNT(*) because it makes things simpler and you can avoid special cases for things like local-global aggregation and moving group bys around joins.

You can actually look at the query plan for "SELECT COUNT(1) from Table;" and see that the output plan is COUNT(*).

You can save it a few instructions by using COUNT(*).  You will also make the ANSI SQL gods happier.

I get a lot of questions about these syntax issues - it makes sense, as this is often very interesting in traditional procedural programming languages.  In many cases, there are only very minor performance differences, if any, in the SQL language due to these.  The big differences happen because your slightly different query actually means something semantically different and thus implies a different amount of work to compute.

Please keep sending in your questions - I'm happy to answer them!

Thanks,

Conor Cunningham

Categories:

So COUNT isn't an operator, but you should view this post as a "what is this operator" kind of post since I talk about how these things work and what they mean.

One of the benefits of building database engines is that all (well, most of) the syntax rules end up stuck in your head, but I know that many of these things are completely non-obvious, so I think it's worth mentioning a few things about what each of these things mean since I see them misused in lots of queries, blogs, etc.

-- Returns the count of rows from the table at the time that the query is run in the transactional isolation mode in which the query is run (usually read committed for SQL Server)
SELECT COUNT(*) FROM Table

So that example is easy enough...

SELECT COUNT(column1) FROM Table

This is actually NOT the same query as COUNT(*), in general.  It means "count the number of non-null column1 rows". 

(Now for something cool.  If you run this on a column that is non-nullable, then SQL Server converts it into the count(*) case because it is faster to run that form since it doesn't have to examine the data in each row and can instead just count rows).

SELECT COUNT(DISTINCT column1) will count the number of UNIQUE non-null column1 values.  It does not count NULL.  I don't believe SQL Server completely removes the DISTINCT operation for non-null columns in all cases.  It can in some cases.

COUNT(col) OVER (...) is a completely different beast.  It runs a count computation using the rules you've seen above but it does not collapse the rows - it adds a new column on all the rows with the computed count. 

Remember that these are semantically different operations.

Conor




Categories:

I started off tonight playing with the new page compression feature.  So far I like it.  I haven't found something yet about which I want to post (which is code for "I'm still looking for the seams ;)"), but I have some other things you can try to learn a few things about how the SQL Server QP makes assumptions about various kinds of predicates in queries during cardinality estimations.

So, you may or may not know much about how the SQL Server QP figures out what plans to run.  For the uninitiated, it almost seems like some form of magic.  In some ways, it is - it's very powerful and poorly understood by many, and it usually requires very little effort by someone skilled in the area to make something amazing happen.  SQL Server merely needs to make itself sparkle when I fix a query plan and I'm set for life :).

In reality, SQL Server uses a cost-based optimizer, which means that it keeps track of all sorts of interesting statistical information, row counts, page counts, etc.  It uses all of these in formulas to come up with numbers for each plan fragment and then it weighs the relative costs of all of these to pick a plan that has the "least cost".  That sounds nice and absolute until you get to go actually try to make that work, and then you are left with all sorts of nasty questions like:

* What should the cost formulas be?
* Do the numbers need to differ based on the customer's hardware? How do we calibrate all of this stuff?  What do we do as machines get faster?
* How do I estimate how many rows are going to come back from one predicate in my WHERE clause or join in time less than running the query to figure it out?
* Same question when I have a bunch of preciates?

Eventually, the QP has to make a set of assumptions so that they can come up with a plan in a reasonable amount of time, both because customers don't like things to ever take time and because managers don't like customers to tell them about how much time something should take..  One assumption might be that, data is uniformly distributed over a data type's possible values when you don't have any better information.  This can help make it possible to come up with solutions that work well most of the time.  The problem is that estimates can be wrong, and that can cause the QP to come up with a different plan than had it had correct information to use when selecting the plan.

So, I'll show you an example here.  To be clear, I'm not saying that this is something that is "broken".  This just exposes a place where 2 different assumptions rub up against each other in a way that will SEEM odd to the outside observer.  When you consider the average customer use cases, these assumptions are not bad and work very well the vast majority of the time...

To the example:
drop table comp1
create table comp1(col1 int identity, col3 nvarchar(3000))

declare @i int 
set @i=0
while @i < 70000
begin
insert into comp1 (col3) values (N'123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890')
set @i=@i+1
end
So I create a table with the same long string in it 70,000 times. 

Then I run a query with a where clause just to get some statistics created:
select  * from comp1 where col3 like '%4567890%'
dbcc show_statistics ('comp1', col3)

Once we have all of this stuff, we can look at the estimates for two very similar queries:

select  * from comp1 where col3 like '123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890%'

select  * from comp1 where col3 like '%123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890%'
(one is LIKE 'abc%'.  The other is LIKE '%abc% where abc is the value we have inserted 70,000 times).

So, both queries will return 70,000 rows. 

Well, the abc% pattern query estimates 70,000 rows (good!).  The second query estimates 69.6985 rows.  that's a bit under ;).

Let's talk about this a bit more so you can understand why.  In the first query, there is an exact string match against a column represented in the statistics histogram.  So, the likely outcome is to take that cardinality count to determine the number of rows that will likely be returned from the query.  In this case, we expect all rows to come back.

In the second one, there is no mechanism to estimate the cardinality of ANY string of this size (SQL Server does have a feature that does smaller strings which is exposed as the "STRING INDEX" in the histogram, but you can't see the details of this object in 2005 and I haven't seen that change in 2008 either).  So, for really large strings, it is left with... guessing.

So, that 69.6985 number is an estimate that is partially based on the length of the string.  Now, the QP could try to walk through each statitistics object and try to find substrings against any existing piece of statistical data and then try to adjust its estimate.  In practice, though, the cost of doing that is expensive.  The various statistics objects are run at different times and have different sample rates, so even then they will vary somewhat.  Finally, for most cases it may just not impact the plan choice that much.  Odds are, though, that this will bite at least one of my readers at some point.  So, this is good to know - it can help you find that spot where the assumption in the QP is causing your query plan to be wrong.  This is the sort of case where you will want to consider a query hint to help the QP out.

There are more assumptions (and seams between them) in the cardinality estimation code.  I'll let you guys go hunt for them a bit to find them.

Happy Querying!
Conor Cunningham




Categories:
SQL Server 2008

One of the areas I managed in SQL Server had to do with the code that automatically builds statistics and uses them during query optimization to come up with a good plan.  Today I'm going to talk a bit about how statistics are built and how this works with parallelism and partitioning.

First things first.  There are two ways in which statistics are created in SQL Server:
1. You create an index.
2. You run a query that needs statistics when they do not exist and the server has "auto-create statistics" enabled.  (That's the simple definition - there are actually caveats I am skipping)

Why does creating an index also create statistics?  Well, the main cost in creating statistics is reading all the pages into memory.  The thought is that if you want to create an index, you have already paid the expensive cost and the server might as well go ahead and create the statistics object for you instead of re-reading those pages later.  So, you get these "for free" (well, almost free).

When you run a query that tries to perform an operation where cardinality estimates would be improved by having statistical information about a column, the server can try to create statistics during the compilation process.  Since compilation time needs to be kept to a minimum, these are usually done over a _sample_ of the pages to avoid making the compilation of a simple query as expensive as an index build.

So the basic plan to create an index is a singlethreaded plan that is something like this:

INSERT (New B-Tree)
  |
Sort
  |
SCAN(Heap or some index)

So during this plan's execution, there is an implicit side-effect to also create this statistics object.

For auto-stats, there is a separate query that is run.  The syntax is not public, but you can see artifacts of this if you look at the profiler and you've turned on the various auto-stats and plan outputs:

drop table stat1
create table stat1(col1 int, col2 int, col3 binary(6000))
declare @i int
set @i=0
while @i < 1000
begin
insert into stat1(col1, col2) values (rand()*1000, rand()*1000)
set @i=@i+1
end


select * from stat1 where col2 > 5


Execution Tree
--------------
Stream Aggregate(DEFINE:([Expr1004]=STATMAN([t1].[dbo].[stat1].[col2])))
  |--Sort(ORDER BY:([t1].[dbo].[stat1].[col2] ASC))
       |--Table Scan(OBJECT:([t1].[dbo].[stat1]))


What is this?  Well, this is the plan that is used to generate the statistics object.  It scans a table, sorts it (into ascending order), and then feeds it into this magical thing called statman.

Now, the details of statman are undocumented, but you can infer that it is a special internal aggregate function that is being run inside of a group by operation (stream aggregate in the plan).  This means that it is collapsing all the rows into some BLOB and then it does something with this.


Next time I hope to talk about parallel statistics build.

Happy querying!

Conor Cunningham

Categories:

I've returned from a small trip and I will be preparing my next SQL post soon.

I've been struggling with slow POP3 sync behavior on my Outlook 2007/Vista 64 box, and I finally found a hammer to beat it back into submission.

The problem - I sync manually, and when I do the application becomes basically so slow as to be non-responsive.  I deleted a bunch of mail to get my mailbox down to a reasonable size (150MB) - still there.  I deleted RSS feeds, blaming XML for my woes again :).  That didn't work either.

I eventually found this:
http://support.microsoft.com/default.aspx?scid=kb;EN-US;935400

Something to do with the new Vista TCP window size algorithm not working well with "legacy" hardware.  No only is the download slow, but using Outlook becomes unbearable as well, so I can't really do anything with the application...  I didn't get timeout errors (at least not frequently), but I certainly had bad feelings about my email experience.

When I disabled the new TCP window behavior, all went back to what was expected.... Now I have to go find all those RSS feeds again :).  So here's to "netsh interface tcp set global autotuninglevel=disabled".  It worked for me, at least so far.

I think that this is a case where Microsoft has an opportunity to compare their offering to gmail and determine "hey, I wonder why people think that a webui is better - they are _so_ much slower...." well, in some cases the thick client is actually slower, and that doesn't make MS look very good.  So I hope this workaround avoids frustration for others :).

I'll also hope that MS could add something to outlook in the next service pack when it realizes that it downloads 15KB in 2 minutes from my POP3 servers.  Perhaps they can add a popup or special error in the next service pack of outlook 2007 to point people in the right direction.


My setup:
Vista x64 SP1
netgear gigabit 8 port switch
linksys wrt54g NAT
some no-name cable modem that my cable company gave to me.


Categories:

Theme design by Nukeation based on Jelle Druyts