New blog location

Here is the new blog  – please update those readers.


http://blogs.msdn.com/conor_cunningham_msft/default.aspx


I’m getting settled into working for Microsoft again – it’s basically like drinking from a firehose.


I am also still hard at work on the book chapter for the upcoming SQL Internals series. 


FYI I believe Paul and Kimberly still plan to host the content I posted here, so you can reference without worry.


Thanks,


Conor

My blog will be moving

As I mentioned in my previous post, I’m going to be taking on a new job soon.  I’m actually going back to work for Microsoft, so it’s not entirely new ;).  I’ll still be living in Austin, TX, and commuting up to visit Redmond regularly.

Unfortunately, working for MS and blogging on a consultant’s website is a bit confusing and perhaps a conflict of interest (or the appearance of one).  Yes, I expect that you guys are smart enough to figure it all out, but it’s probably not a great long-term situation.  So, I’ll have to move my ramblings back over to MSDN or somewhere like that.  I’ll post up a link in a week or so once I have it all figured out.

I’d like to thank Paul and Kimberly for allowing me to blog here – they have been extremely supportive of me doing this, and I certainly appreciate it.  I hope that my content has added to their site.

What will I be doing at Microsoft?  Well, my role will be as a software architect, which is pretty much what I did previously.  I spend time trying to figure out how to make features that will help customers, make those features work well, work fast, work reliably, etc.  I will still be working on the SQL Engine, Query Processing, and whatever else comes my way that interests me.

Why not move back to Redmond?  Well, I moved away due to the climate, and I don’t have any particular desire to move back.  It’s obviously imperfect to work remotely, but I really *do* like building database engines.  The SQL Team was great in finding a way for me to continue working for Microsoft, and I was more than a bit flattered that they were so open in making this work for me.

I plan to leverage my position out here to talk to customers more than is often possible “inside the bubble”, and I am hopeful that this will make me more effective at my job.

That’s the news for today

Thanks,

Conor Cunningham

What’s Conor been doing?

Well, I can tell you that I’ve been doing a few things lately.


1. I’m writing a chapter for an upcoming book – lots of interesting details like what you read here on the blog about query processing and query optimization.  It will come out around the end of the year.


2. I’ve been getting a new job – I’ll post some details on this soon.  This has taken some more time :).


3. I’ve been doing some consulting work – helping people with some of the more advanced topics in SQL Server.


4. I’ve been getting ready to improve the sprinkler system at my house – going to try this out: http://www.smarthome.com/312cr.html


5. I’m going to try to fix my HTPC.  I have a Hauppage HDTV card that was working fine with Vista Ultimate + Media Center, and then I installed http://www.mythtv.org/ - MythTV (and me trying to futz with it) won’t believe that I live in my actual timezone.  This tends to make recording things very difficult.  I believe that however it has installed Java has stored my timezone differently than windows.  Now the whole thing just doesn’t work, and I’ll call that “frustrating”.


6. I have been on an automation crusade, trying to automate various tasks around the installation and setup of my computers.  For example, I am collecting up the registry changes to fix up the windows UI to my liking on each new machine.  I am also writing scripts to install  all of my commonly used tools.  I have determined that I am too old to continue to fight through the same silly user interfaces each time I set up a VM or new machine.


I am still interested in any questions you have, but I’ll likely continue to post at a somewhat reduced rate until I get this book chapter done :).  I’ve got deadlines!


Happy Querying,


Conor Cunningham

A nice MSDN article on SQL 2008 Data Warehousing

http://technet.microsoft.com/en-us/magazine/cc434693.aspx

It covers some of the new features in SQL 2008 w.r.t. Data Warehousing.

Conor Cunningham

The Trouble with Triggers

(Apologies to Star Trek).

I received a question about trigger performance, especially the do’s and dont’s about how they are used.

Let’s first start by saying that triggers can be very useful tools to facilitate server-side programming in OLTP systems.  They are powerful ways to do things like audit changes to a table, validate business logic, or extend the features that the database server provides to you.  So, in the proper hands, these can really provide a lot of business value.

The problem with this area is that there is a great temptation to think about databases procedurally, as you would with a programming language like C++ or C#.  You can write code that looks like a procedural function call and have it get called for each insert into table!  Before you know it, non-database programmers are checking in code to your production sysem.  Suddenly, your application grinds to a halt because a trigger plan has no index or is poorly designed. 

Databases are great tools for working on SETS of rows.  You can define queries that can query the whole data efficiently.  If you do it properly, you can solve a lot of problems very easily.  So, for any given query, the startup time for a query is relatively high, but the per-row cost is optimized by lots of smart people.  I can assure you that they count instructions, worry about CPU architectures, memory access trends, cache hierarchies, and every other trick in the book to make the per-row cost of a query as low as possible.  So, it may take a few million instructions to start a query vs. a few thousand or tens of thousands to process a row in a query.

When you add a trigger, you should think that you are adding another query whenever you run the first one.  Furthermore, if you have written your application to call the database more often than necessary (say, once per row instead of batching things up as sets), then you are now adding the cost of that second query to the PER-ROW cost of the query.  That is pretty close to the definition of a “bad thing” in the dictionary…

In addition to the number of times that a trigger is called, the trigger plan itself may not be set up to run efficiently.  If the trigger enforces business logic that touches a lot of tables or rows, you need to consider what indexes to add.  I usually create an equivalent SELECT statment and run it to look at the query plan (I’d fake data and a table for the inserted table, for example).  This will help me find slow queries before the trigger is deployed – you probably want to look closely at every table scan to see if it is really necessary).

I’ve put together some examples that you can try to see how the system behaves under different scenarios with triggers.  I would publish some numbers to give you relative ideas about how fast things are, but I think that the current CTP of SQL Server that I am running has a few issues with memory on my machine that cause it to page wildly occassionally, so I will let you go run these on your own.  The basic idea is that things will get slower as you do more and more “bad practices” with triggers.

create table table1 (col1 int primary key identity, col2 int, col3 binary(2000), col4 binary(4000))
declare @a int
set @a=0
while @a < 20000
begin
insert into table1 (col2) values (@a/100);
set @a+=1
end
go
-- target table on which we are creating triggers
create table table2 (col1 int primary key identity, col2 int, col3 binary(2000), col4 binary(4000))

-- audit table
create table table3(col1 int, col2 binary(100))

-- try to insert 20,000 rows with no triggers defined - should be pretty fast.
begin transaction
insert into table2  select col2, col3, col4 from table1
rollback
go

-- now create a simple trigger that reads all the inserted rows and copies them to table3 (an audit table)
create trigger trig1 ON table2 FOR INSERT AS INSERT INTO table3(col1, col2) select col1, null from inserted
begin transaction
insert into table2  select col2, col3, col4 from table1
rollback
-- this will run the trigger query once with all 20,000 rows.  
go

-- here is where triggers starts to hurt.  Let's write the same query except that 
-- we read each row separately in its own query.  (This will be even worse if you remove the begin transaction
-- since each row will need to commit the log to disk).
begin transaction
declare @i int
set @i=0
while @i < 20000
begin
insert into table2  select col2, col3, col4 from table1 where col1 = @i
set @i+=1
end
rollback
-- This is a very poor use of the database - In almost every case, using a set-based approach will 
-- yield better performance.  You avoid multiple transactions.  You avoid running through a long codepath
-- to start each query.  You give the optimizer to consider more advanced plans.
go

-- let's remove trig1 for our next example
drop trigger trig1 
go

-- now let's create a trigger that copies ~100 rows every time it is called.  Furthermore, let's 
-- restrict on a non-indexed column in the source table.  This will lead to a table scan for each call.
-- This should simulate your average "bad" trigger formulation.
create trigger trig2 ON table2 FOR INSERT AS INSERT INTO table3(col1, col2) select col1, null from table1 where col2=5;
go

-- and, for fun, we'll run the one row at a time example again.
-- (go get some coffee, or whatever else you do to amuse yourself - this will take awhile.)
begin transaction
declare @i int
set @i=0
while @i < 20000
begin
insert into table2  select col2, col3, col4 from table1 where col1 = @i
set @i+=1
end
rollback

So what is a responsible DBA or database developer to do??? :)

Well, there is hope.  A few things to consider:

1. Look at that trigger – do you need it?  really?  Perhaps not.  Consider each one and make sure that you really, really need it.

2. Make sure that you call any trigger the minimum number of times.  SET STATISTICS PROFILE ON can help you see how many calls are happening for any given query. 

3. You can often pull the logic for a trigger up into the stored procedure that is calling your query.  This works if you can control all of the codepaths into the system for changes to the system – this is often the case in OLTP applications.

4. SQL Server 2005 added some neat features that help with benchmarks.  However, they are also quite useful for you if you have the time to consume them.  Look up INSERT… with OUTPUTs.  This lets you insert rows into one table and then perform an operation with the rows after the insertion (return them as a select query, insert them into another table, etc).  This is a very cool feature.  It takes quite a bit of knowledge to wrap your head around what the database system is doing to make this all work (too much blogging for this post, to be sure), but I can assure you that it is pretty fast :).

Bottom line – I recommend that you have one person at your company who is responsible to understand the schema, including things like triggers.  This means that someone can think through the tradeoffs and make reasonable decisions to avoid the most common pitfalls.

Happy Querying, folks!

Conor Cunningham

Outer joins and ON clauses vs. WHERE clauses

(I am still around – I’ve just been busy with a few personal projects of late – keep asking questions if you have them).

I received a question from a reader named Andy that was a follow-up to my questions on OUTER JOINs and WHERE clauses and ON clauses.

Upon re-reading my previous entries, I determined that I got distracted talking about the old-style join syntax instead of giving a good explanation about ON vs. WHERE _semantics_ for outer joins.

select * from cc1 left outer join cc2 on (cc1.col1=cc2.col1)

drop table cc1
create table cc1(col1 int, col2 int)
insert into cc1(col1, col2) values (1, 2)
insert into cc1(col1, col2) values (2, 3)

create table cc2(col1 int, col2 int)
insert into cc2(col1, col2) values (1, 2)

select * from cc1 left outer join cc2 on (cc1.col1=cc2.col1)

col1 col2 col1 col2
----------- ----------- ----------- -----------
1 2 1 2
2 3 NULL NULL

(2 row(s) affected)


select * from cc1 left outer join cc2 ON (1=1) WHERE (cc1.col1=cc2.col1)

col1 col2 col1 col2
----------- ----------- ----------- -----------
1 2 1 2

(1 row(s) affected)

The question boils down to “why are these two queries returning different results?” This is a very good question.

I’ll explain the what, then I’ll try to explain the why/how:

In the first query, the ON clause has the predicate. This logically ties the predicate to the join. Since OUTER JOINS can return rows that do not match, the predicate is used to determine what “doesn’t match”.

In the second case, the join part actually tells the QP to do a full cross product.  The WHERE clause is actually not bound to the OUTER join semantics at all. This is why they return different results. 

Now let’s talk “why/how”.  The QP represents this second query as a tree with a filter above the join.  The QP has lots of smarts in it.  One of the thing it can determine is that the filter condition (cc1.col1=cc2.col2) actually prevents NULLs from being returned since NULL = anything is UNKNOWN in three-value logic (and therefore not TRUE and therefore not returned from the WHERE clause).  Since all NULL values from the non-outer side of the join are removed, this is logically equivalent to running an inner join because all of the extra rows for non-matching rows are actually removed.

select * from cc1 INNER join cc2 ON (1=1) WHERE (cc1.col1=cc2.col1)

If you look at the query plans, you will see that the second query is actually running an inner join because of this recognition in the QP that you don’t need an outer join.  (The QP can more freely reorder inner joins, so it prefers to convert outer to inner joins where possible).

I hope that gives a better explanation as to why SQL Server returns different results for this query!

Happy Querying!

Conor


 

Contradictions within Contradictions

I had a question from a reader about contradiction detection.

The basic idea is to determine that this kind of query:

SELECT * FROM TABLE WHERE col1 > 5 and col1 < 0;

…is utterly meaningless and requires no work to be done.  It will always return zero rows.  In fact, in many cases the SQL Server QP will detect cases like this and actually remove the table from the query completely.  In its place, you have this magical “constant scan” operator, which is really just an in-memory row generator.  So, you can read from it instead of from a table.

This optimization is done in many places in the SQL Server QP, and it is great because it requires no I/O, no locks, and less CPU than actually reading from a table.

The question I had from a user had to do with cases when this optimization does NOT happen.

Since the reader sent me a perfect example, I’ll just copy it up here (note – I am happy to credit readers if they would like credit, but I don’t post their name unless they want me to do so).

–drop table #tmp

create table #tmp (id Int Identity(1,1) Primary key, name VarChar(250))

go

insert into #tmp(name) values(NEWID())

insert into #tmp(name) values(NEWID())

go

set statistics profile on

go

– Execution plan create a Constant Scan

select * from #tmp

where id = 1 and id = 5

go

set statistics profile off

 

GO

 

select * from #tmp  where id = 1 and id = 5
  |–Constant Scan


–drop table tmp

create table tmp (id Int Identity(1,1) Primary key, name VarChar(250))

go

insert into tmp(name) values(NEWID())

insert into tmp(name) values(NEWID())

 

go

set statistics profile on

– Why execution plan does not create a
Constant Scan for this case?

select * from tmp

where id = 1 and id = 5

go

set statistics profile off

SELECT * FROM [tmp] WHERE [id]=@1 AND [id]=@2
  |–Clustered Index Seek(OBJECT:([t1].[dbo].[tmp].[PK__tmp__3213E83F15502E78]), SEEK:([t1].[dbo].[tmp].[id]=CONVERT_IMPLICIT(int,[@1],0)),  WHERE:([t1].[dbo].[tmp].[id]=CONVERT_IMPLICIT(int,[@2],0)) ORDERED FORWARD)

So this is interesting – in one case the constant scan is done, while in the other case it is not…. hrm.  what’s up with that?

Well, some of you are familiar with autoparameterization – a feature that can take unparameterized queries, try to make them parameterized, and then avoid compilation for multiple similar simple queries.  Autoparameterization is tied to the “trivial plan” feature, meaning that the optimizer feels that there isn’t really a plan choice and thus going through the complete cost-based optimization process is unnecessary.

The difference you see is based on autoparameterization and whether it happens or not.  Now, the autoparameterization rules are fairly arcane at this point – they’ve been around, in some form, for various releases dating back to SQL 7.0.  TempDB has a different set of rules about whether a query should be autoparameterized since these tables are globally shared.  So, in the #tmp case the table isn’t being autoparameterized, the rest of the logic to simplify the query is run, and this rule detects the contradiction.

In the latter case, the query IS auto-parameterized.  The optimizer doesn’t see two parameters as always being a contradiction (you could pass the same value for both parameters), and then the query is not simplified.

For what it is worth, a somewhat internal knowledge of what causes a plan to be considered trivial or not can influence whether you see one plan or the other.  For example, query hints make a plan not trivial (why hint if there is no plan choice?), and as such you could do something like this:

select * from tmp
where id = 1 and id = 5 option (maxdop 1)

This will get you the constant scan in all databases.  Adding enough rows to get the cost above the threshold for parallelism (in a parallel-enabled machine + SKU of SQL) would also do it. 

Remember, trivial plan was developed before contradiction detection, so this should be viewed in that context.  It’s a bit odd, but the ecosystem developed this way and you should be aware of that when developing SQL Server apps.

I enjoy the questions – keep them coming!

Thanks,
Conor

SQL 2008 Geometry type tidbits


I spent some time playing with the SQL 2008 geometry data type.  This post will cover some of the things I have learned about how it is implemented from playing with the feature.  Some of these will impact the feature’s usability for some.  Others may not care.

1. late-binding of the geometry data is an interesting choice.  So one of the nice properties of a SQL query is that, for many things, the parse and bind of a query can find all sorts of compilation errors that would show up during development.  You don’t have to run the query in order to see the simple mistakes, and that helps us all.  This type doesn’t throw parsing errors until you try to execute the query.  That’s not perfect – you have to go execute spatial queries to see whether you have typed in lots of complex data correctly or not.  I can speak from experience that I tend to mess that stuff up.

Example:

create table g1 (col1 int identity, col2 geometry)
go
insert into g1(col2) values (Geometry::STGeomFromText('LINESTRING(0 0, 10 10, 21 2)',0));
go
insert into g1(col2) values (Geometry::STGeomFromText('LINESTRING(funky chicken)',0));
go

That third query compiles just fine – running it returns an error.

2. The geography type uses some form of the .net framework.  (I’ll point out that I have not enabled the CLR in my server).  Given that the ordpath type requires that the CLR be enabled, I’ll guess that this means that there is a special implementation of the geometry type.

Here’s the runtime errof from the previous query:

Msg 6522, Level 16, State 1, Line 1
A .NET Framework error occurred during execution of user-defined routine or aggregate “geometry”:
System.FormatException: 24141: A number is expected at position 16 of the input. The input has funky.
System.FormatException:
   at Microsoft.SqlServer.Types.OpenGisWktReader.RecognizeDouble()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParsePoint()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseLineStringText(FigureAttributes attributes)
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseLineStringTaggedText()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ParseGeometryTaggedText()
   at Microsoft.SqlServer.Types.OpenGisWktReader.ReadGeometry()
   at Microsoft.SqlServer.Types.SqlGeometry.STGeomFromText(SqlChars geometryTaggedText, Int32 srid)
.
The statement has been terminated.

(I think I will be making t-shirts that say “The input has funky” on it… let me know if you want in :).

I am actually pretty happy with the error message – the only thing missing is which column caused the error.  You’ll notice that “position 16″ is inside of this second string being parsed by the .NET routines.  I personally do find stack traces unacceptable for general use, but the .net and java folks seem to have doubled down on this approach, so we all get to learn about their libraries even in SQL now…

This is just a seam in how it was implemented.  Be aware of this when using the type.

3. the geometry type has been (arbitrarily) given an estimated row width of about 4000 bytes.  well, this is what shows up in showplan.  So I inserted a bunch of data into this table (1.4 million rows in about 148MB).  The cost estimates don’t seem to use the 4000 byte number for the row width (as it would take a long time to read 5.6GB – more than 13 seconds or so, given that is what the costing says, so I am guessing that this is just reported in showplan the same way it is costed internally).  I’m just using a heap (no indexes) for this experiment.

4. I’m waiting for maps.live.com  to make it easy for me to geocode addresses and insert them into SQL.  It has a feature to send it to some gps device, but I haven’t played with it beyond learning that it didn’t detect that I had blocked cookies by default ;).

5. “select col2.ToString() from g1″ is a way to see something useful from that data type.  otherwise you get back binary goo…

That’s all for tonight.

Thanks,

Conor Cunningham

COUNT(1) vs. COUNT(star)

A follow-up from my previous post.

Is COUNT(1) faster than COUNT(*)?

No, COUNT(1) is mapped into COUNT(*) early in the QP.  From the logic of my previous post, COUNT(1) is always non-null and thus is the same as just doing a COUNT(*).  Since it is semantically equivalent to COUNT(*), the QP just maps it to COUNT(*) because it makes things simpler and you can avoid special cases for things like local-global aggregation and moving group bys around joins.

You can actually look at the query plan for "SELECT COUNT(1) from Table;" and see that the output plan is COUNT(*).

You can save it a few instructions by using COUNT(*).  You will also make the ANSI SQL gods happier.

I get a lot of questions about these syntax issues – it makes sense, as this is often very interesting in traditional procedural programming languages.  In many cases, there are only very minor performance differences, if any, in the SQL language due to these.  The big differences happen because your slightly different query actually means something semantically different and thus implies a different amount of work to compute.

Please keep sending in your questions – I'm happy to answer them!

Thanks,

Conor Cunningham

 

COUNT(star) vs COUNT(column) vs. COUNT(DISTINCT) vs. COUNT(col) OVER(…)

So COUNT isn't an operator, but you should view this post as a "what is this operator" kind of post since I talk about how these things work and what they mean.

One of the benefits of building database engines is that all (well, most of) the syntax rules end up stuck in your head, but I know that many of these things are completely non-obvious, so I think it's worth mentioning a few things about what each of these things mean since I see them misused in lots of queries, blogs, etc.

– Returns the count of rows from the table at the time that the query is run in the transactional isolation mode in which the query is run (usually read committed for SQL Server)
SELECT COUNT(*) FROM Table

So that example is easy enough…

SELECT COUNT(column1) FROM Table

This is actually NOT the same query as COUNT(*), in general.  It means "count the number of non-null column1 rows". 

(Now for something cool.  If you run this on a column that is non-nullable, then SQL Server converts it into the count(*) case because it is faster to run that form since it doesn't have to examine the data in each row and can instead just count rows).

SELECT COUNT(DISTINCT column1) will count the number of UNIQUE non-null column1 values.  It does not count NULL.  I don't believe SQL Server completely removes the DISTINCT operation for non-null columns in all cases.  It can in some cases.

COUNT(col) OVER (…) is a completely different beast.  It runs a count computation using the rules you've seen above but it does not collapse the rows – it adds a new column on all the rows with the computed count. 

Remember that these are semantically different operations.

Conor