In this post, I’ll go over a few things about IN plans in the SQL Server Query Optimizer. Currently I’m using the SQL Server 2008 November CTP, but this behavior is still valid on SQL 2005.
I’ll be using the following script for my discussion, and you can try this on your own.
create database in1
use in1
create table t1(col1 int identity, col2 int default rand()*100000, col3 binary(2000))
declare @i int
set @i=0
while @i < 10000
begin
insert into t1 default values
set @i=@i+1
end
create table t2(col1 int identity, col2 binary(200))
— plan 1
select * from t1 where col2 in (1)
— plan 2
select * from t1 where col2 in (1,2,3,4,5,6,7,8,9,10)
— plan 3
select * from t1 where col2 in (1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16)
— plan 4
select * from t1 inner join t2 on t1.col1=t2.col1
where t1.col2 in (
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,61,62,63,64)
— plan 5
select * from t1 inner join t2 on t1.col1=t2.col1
where t1.col2 in (
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,61,62,63,64,65)
So, we have a database with 2 tables, and I created a bunch of random data in a column so that we can play with IN plans.
ok, plan 1 is “easy”. I use IN with a single operator. This happens to turn into a simple plan that looks like this (using set showplan_text to view):
|–Table Scan(OBJECT:([in1].[dbo].[t1]),WHERE:([in1].[dbo].[t1].[col2]=CONVERT_IMPLICIT(int,[@1],0)))
Hrm. hey, the where clause is in the table scan? no separate operator? It’s a physical optimization that I’ll cover in another post (I may have covered this in my previous blog posts – I’ll go check), and it’s called “pushing non-SARG predicates” in SQL Server.
Let’s compare with the plan for plan 2:
|–Table Scan(OBJECT:([in1].[dbo].[t1]), WHERE:([in1].[dbo].[t1].[col2]=(1) OR [in1].[dbo].[t1].[col2]=(2) OR [in1].[dbo].[t1].[col2]=(3) OR [in1].[dbo].[t1].[col2]=(4) OR [in1].[dbo].[t1].[col2]=(5) OR [in1].[dbo].[t1].[col2]=(6) OR [in1].[dbo].[t1].[col2]=(7) OR [in1].[dbo].[t1].[col2]=(8) OR [in1].[dbo].[t1].[col2]=(9) OR [in1].[dbo].[t1].[col2]=(10)))
Same plan shape, just more conditions.
I wonder if this goes on forever? Well, plan 3 looks different:
|–Filter(WHERE:([in1].[dbo].[t1].[col2]=(1) OR [in1].[dbo].[t1].[col2]=(2) OR [in1].[dbo].[t1].[col2]=(3) OR [in1].[dbo].[t1].[col2]=(4) OR [in1].[dbo].[t1].[col2]=(5) OR [in1].[dbo].[t1].[col2]=(6) OR [in1].[dbo].[t1].[col2]=(7) OR [in1].[dbo].[t1].[col2]=(8) OR [in1].[dbo].[t1].[col2]=(9) OR [in1].[dbo].[t1].[col2]=(10) OR [in1].[dbo].[t1].[col2]=(11) OR [in1].[dbo].[t1].[col2]=(12) OR [in1].[dbo].[t1].[col2]=(13) OR [in1].[dbo].[t1].[col2]=(14) OR [in1].[dbo].[t1].[col2]=(15) OR [in1].[dbo].[t1].[col2]=(16)))
|–Table Scan(OBJECT:([in1].[dbo].[t1]))
Now the filter is split out into another query operator. (It starts happening with 16 predicates)
OK, these 3 plans have been simple query plans. In fact, you can do “set showplan_xml on” and run the query and learn something about how much optimization was done on them:
<ShowPlanXML xmlns=”http://schemas.microsoft.com/sqlserver/2004/07/showplan” Version=”1.0″ Build=”10.0.1075.23″>
<BatchSequence>
<Batch>
<Statements>
<StmtSimple StatementText=”select * from t1 where col2 in (1)” StatementId=”1″ StatementCompId=”1″ StatementType=”SELECT” StatementSubTreeCost=”2.48687″ StatementEstRows=”1″ StatementOptmLevel=”TRIVIAL” ParameterizedText=”(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1″>
The end-to-end design of the optimizer is beyond the scope of this post, but there are various stages of optimization and queries can “quit early” if they find a “good enough” plan. In this case, there aren’t actually any real cost-based decisions to make about which plan to run, so these plans are called “trivial”.
Once you start adding joins into the mix, there are join orders to consider, and those are cost-based. You can see that the optimization level for plan 4 differs:
<StmtSimple StatementText=”select * from t1 inner join t2 on t1.col1=t2.col1
where t1.col2 in (
1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,
51,52,53,54,55,56,57,58,59,60,61,62,63,64)” StatementId=”1″ StatementCompId=”1″ StatementType=”SELECT” StatementSubTreeCost=”2.70289″ StatementEstRows=”1″ StatementOptmLevel=”FULL“>
One would infer that FULL > TRIVIAL ;).
Plan 4:
|–Hash Match(Inner Join, HASH:([in1].[dbo].[t2].[col1])=([in1].[dbo].[t1].[col1]))
|–Table Scan(OBJECT:([in1].[dbo].[t2]))
|–Filter(WHERE:([in1].[dbo].[t1].[col2]=(1) OR [in1].[dbo].[t1].[col2]=(2) OR [in1].[dbo].[t1].[col2]=(3) OR [in1].[dbo].[t1].[col2]=(4) OR [in1].[dbo].[t1].[col2]=(5) OR [in1].[dbo].[t1].[col2]=(6) OR [in1].[dbo].[t1].[col2]=(7) OR [in1].[dbo].[t1].[col2]=(8) OR [in1].[dbo].[t1].[col2]=(9) OR [in1].[dbo].[t1].[col2]=(10) OR [in1].[dbo].[t1].[col2]=(11) OR [in1].[dbo].[t1].[col2]=(12) OR [in1].[dbo].[t1].[col2]=(13) OR [in1].[dbo].[t1].[col2]=(14) OR [in1].[dbo].[t1].[col2]=(15) OR [in1].[dbo].[t1].[col2]=(16) OR [in1].[dbo].[t1].[col2]=(17) OR [in1].[dbo].[t1].[col2]=(18) OR [in1].[dbo].[t1].[col2]=(19) OR [in1].[dbo].[t1].[col2]=(20) OR [in1].[dbo].[t1].[col2]=(21) OR [in1].[dbo].[t1].[col2]=(22) OR [in1].[dbo].[t1].[col2]=(23) OR [in1].[dbo].[t1].[col2]=(24) OR [in1].[dbo].[t1].[col2]=(25) OR [in1].[dbo].[t1].[col2]=(26) OR [in1].[dbo].[t1].[col2]=(27) OR [in1].[dbo].[t1].[col2]=(28) OR [in1].[dbo].[t1].[col2]=(29) OR [in1].[dbo].[t1].[col2]=(30) OR [in1].[dbo].[t1].[col2]=(31) OR [in1].[dbo].[t1].[col2]=(32) OR [in1].[dbo].[t1].[col2]=(33) OR [in1].[dbo].[t1].[col2]=(34) OR [in1].[dbo].[t1].[col2]=(35) OR [in1].[dbo].[t1].[col2]=(36) OR [in1].[dbo].[t1].[col2]=(37) OR [in1].[dbo].[t1].[col2]=(38) OR [in1].[dbo].[t1].[col2]=(39) OR [in1].[dbo].[t1].[col2]=(40) OR [in1].[dbo].[t1].[col2]=(41) OR [in1].[dbo].[t1].[col2]=(42) OR [in1].[dbo].[t1].[col2]=(43) OR [in1].[dbo].[t1].[col2]=(44) OR [in1].[dbo].[t1].[col2]=(45) OR [in1].[dbo].[t1].[col2]=(46) OR [in1].[dbo].[t1].[col2]=(47) OR [in1].[dbo].[t1].[col2]=(48) OR [in1].[dbo].[t1].[col2]=(49) OR [in1].[dbo].[t1].[col2]=(50) OR [in1].[dbo].[t1].[col2]=(51) OR [in1].[dbo].[t1].[col2]=(52) OR [in1].[dbo].[t1].[col2]=(53) OR [in1].[dbo].[t1].[col2]=(54) OR [in1].[dbo].[t1].[col2]=(55) OR [in1].[dbo].[t1].[col2]=(56) OR [in1].[dbo].[t1].[col2]=(57) OR [in1].[dbo].[t1].[col2]=(58) OR [in1].[dbo].[t1].[col2]=(59) OR [in1].[dbo].[t1].[col2]=(60) OR [in1].[dbo].[t1].[col2]=(61) OR [in1].[dbo].[t1].[col2]=(62) OR [in1].[dbo].[t1].[col2]=(63) OR [in1].[dbo].[t1].[col2]=(64)))
|–Table Scan(OBJECT:([in1].[dbo].[t1]))
ok, this looks like an extension of what we saw in plan 3 – a filter operator above a table scan, followed by a join.
Query 5 is only trivially different than 4 – 65 conditions instead of 64:
|–Nested Loops(Left Semi Join, OUTER REFERENCES:([in1].[dbo].[t1].[col2]))
|–Hash Match(Inner Join, HASH:([in1].[dbo].[t2].[col1])=([in1].[dbo].[t1].[col1]))
| |–Table Scan(OBJECT:([in1].[dbo].[t2]))
| |–Table Scan(OBJECT:([in1].[dbo].[t1]))
|–Filter(WHERE:([in1].[dbo].[t1].[col2]=[Expr1008]))
|–Constant Scan( VALUES:(((1)),((2)),((3)),((4)),((5)),((6)),((7)),((8)),((9)),((10)),((11)),((12)),((13)),((14)),((15)),((16)),((17)),((18)),((19)),((20)),((21)),((22)),((23)),((24)),((25)),((26)),((27)),((28)),((29)),((30)),((31)),((32)),((33)),((34)),((35)),((36)),((37)),((38)),((39)),((40)),((41)),((42)),((43)),((44)),((45)),((46)),((47)),((48)),((49)),((50)),((51)),((52)),((53)),((54)),((55)),((56)),((57)),((58)),((59)),((60)),((61)),((62)),((63)),((64)),((65))))
Hey, what’s this? A Semi Join? Well, this is a join where rows from one side are joined with another table (in this case, an in-memory table) and only qualifying rows from the original table are kept. To make this work, you can’t return any columns from the “other side”, and you have to make sure you don’t have more than one match on the constant table. So, you need to remove duplicates. Try adding another ‘1’ into the IN list and then look at the plan – it removes them.
Why are there different plans at these points? Well, while I can’t speak for the development team, as I’m not on it anymore, but I will say that often these are added as an attempt to deal with poor performance or some very common queries that are seen all the time (perhaps from a large ISV). There’s obviously not a ton of magic to why 16 and 64 are picked for the magic numbers, and I’m sure if we hunted we could find some cases where the transition between plans was not perfect. Often changing the numbers, even if they aren’t quite perfect, will just cause instability elsewhere in terms of plan quality so they are left to try to avoid pain to customers. I’m hopeful that you can use this post to start to do the same kind of reasoning yourself. While I have a pretty good idea of where the plans will change, I’m not revealing anything you can’t see with some trial and error on your own. I actually had forgotten the 64 number, as I think it changed at one point – so, I had to do figure it out again ;).
OK, this covers the basic forms of plans with INs that are available in SQL Server. I plan to work on the plan cache next, including forced parameterization.
Thanks,
Conor