This post builds upon my previous post on some of the various kinds of plans for the IN operator.  I will cover some of the query plan cache behavior associated with IN plans.

For this post we’ll be using the following query which will let you look at the text of plans in the cache and also see how many plans are in the cache.  (One can use “dbcc freeproccache” to clean out the cache whenever you are testing this – please don’t do this on a production system unless you know what you are doing)

(Hey, I figured out how to post code properly in the blog.  and there was much rejoicing! yeah!)

SELECT text, plan_handle, * FROM sys.dm_exec_query_stats AS qs
CROSS APPLY sys.dm_exec_sql_text(sql_handle)
CROSS APPLY sys.dm_exec_text_query_plan(qs.plan_handle, qs.statement_start_offset, qs.statement_end_offset) AS qp

OK, the two things of interest are the number of rows returned and the text of each plan.  I will note that the data in these sys.dm_exec* functions is “mostly up to date”, but it doesn’t follow an exact transactional semantic – so, you may need to wait a second or two to get the “proper” state of the plan cache after each step.

If you run dbcc freeproccache, then this query, you’ll see one entry in the result (for the query you ran to get the status).  If we run query 1 from our previous post (select * from t1 where col2 in (1)) and look at the result, we get something interesting:
(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1

Hey, what’s this?  Well, this is called “auto-parameterization” in SQL Server, and it happens on Trivial Plans (see previous post).  The idea is that SQL Server has a fancy query processor, but fancy means powerful for big queries and perhaps not the fastest on the small stuff.  In general, we’re not saying that it’s slow, but if you are compiling thousands of these things right in a row, you may notice a difference.  So, this autoparameterization feature is an attempt to reduce compilation time by finding plans that have no cost-based decisions and then templatize-ing them so that all other queries that look the same can re-use the plan.  This has a huge advantage for customers with query-generation components, as they often will generate many, many queries that only differ by the constants.  So, SQL Server uses this on trivial plans.

If you run the query “select * from t1 where col2 in (2)” (so, the constant is changed but nothing else), then run our big fancy cross apply query, then you see 2 query plans instead of 3.  Both of our new queries use the following plan:

(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1

Now, running queries with more than one element in the IN list generates a plan for each of them.  (I don’t recall exactly why off-hand, but I’ll think about it some more)

So, you can see the plan entries for a range of variations.

select  * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2,3,4,5,6,7,8,9,10)
(@1 tinyint)SELECT * FROM [t1] WHERE [col2]=@1
select * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2)
select * from t1 where col2 =1 or col2=2

I’ll point out that, except for trivial plans that are auto-parameterized, the plan text is used as the “key” into this cache.  It does not logically compare the output plans or even a normalized form of the input.  So, you’ll notice that “
select  * from t1 where col2 in (2,1)” is in the cache twice because I put a space in the second time I ran that query.  So, you need to consider this when we get to the original question about plan cache pollution.  If your query generation engine (LINQ or whatever) adds comments into the query text sent to SQL Server, that could very well lead to more plans in the cache.

The more complex queries from my last post also generate unique plan entries in the same way, so I’ll skip those here.

The next interesting factor to consider is forced parameterization.  This was added in SQL Server 2005, and it is both powerful and dangerous.  This feature is a database level feature, most literals are converted to paramters before query optimization in most query plans.  This is covered under the “Forced Parameterization” topic in SQL Server books online, so I’ll suggest you read that if you want all the details.  I did a post or two on plan sensitivity to different parameter values, so beware that in some cases you may end up with query plans that are suboptimal and are stuck in the cache.  So, you should experiment with this setting to make sure that it really does benefit your application before you turn it on.

As I like to live dangerously, I will ignore all of these warnings and proceed with great vigor in the name of Science!

Let’s try three queries that yielded separate plans without forced parameterization enabled:

select * from t1 where col2 in (2,1)
select * from t1 where col2 in (1,2)
select  * from t1 where col2 in (2,1)

Now we get one entry in the cache, and it looks similar to those auto-parameterized queries:
(@0 int,@1 int)select * from t1 where col2 in ( @0 , @1 )

Notice that even spaces are “fixed” in terms of cache hits in this mode.  The query is parsed and then the text is generated from the parsed tree, which gives one the ability to normalize the input SQL for the plan cache lookup.

Adding a 3-element IN list query generates a new plan in the cache:
select * from t1 where col2 in (1,2,3)
(@0 int,@1 int,@2 int)select * from t1 where col2 in ( @0 , @1 , @2 )
(@0 int,@1 int)select * from t1 where col2 in ( @0 , @1 )

OK, so we can get a plan per number of elements, which perhaps has its advantages in some cases.  Good to know.

I also tried our large cardinality IN list example (the one that generated the semi-join in my previous post):
(@0 int,@1 int,@2 int,@3 int,@4 int,@5 int,@6 int,@7 int,@8 int,@9 int,@10 int,@11 int,@12 int,@13 int,@14 int,@15 int,@16 int,@17 int,@18 int,@19 int,@20 int,@21 int,@22 int,@23 int,@24 int,@25 int,@26 int,@27 int,@28 int,@29 int,@30 int,@31 int,@32 int,@33 int,@34 int,@35 int,@36 int,@37 int,@38 int,@39 int,@40 int,@41 int,@42 int,@43 int,@44 int,@45 int,@46 int,@47 int,@48 int,@49 int,@50 int,@51 int,@52 int,@53 int,@54 int,@55 int,@56 int,@57 int,@58 int,@59 int,@60 int,@61 int,@62 int,@63 int,@64 int,@65 int)select * from t1 inner join t2 on t1 . col1 = t2 . col1 where t1 . col2 in ( @0 , @1 , @2 , @3 , @4 , @5 , @6 , @7 , @8 , @9 , @10 , @11 , @12 , @13 , @14 , @15 , @16 , @17 , @18 , @19 , @20 , @21 , @22 , @23 , @24 , @25 , @26 , @27 , @28 , @29 , @30 , @31 , @32 , @33 , @34 , @35 , @36 , @37 , @38 , @39 , @40 , @41 , @42 , @43 , @44 , @45 , @46 , @47 , @48 , @49 , @50 , @51 , @52 , @53 , @54 , @55 , @56 , @57 , @58 , @59 , @60 , @61 , @62 , @63 , @64 , @65 )

Yes, it seems to keep on going.  For once, I read the documentation and determine that one of the disqualifying conditions is:

Statements that contain more than 2,097 literals that are eligible for

So, it will stop at some point based on how many literals you have in your query.

So while we’re not up to LINQ yet, I think we have a pretty good basis to evaluate whether LINQ or any other query generator is going to cause plan cache issues.  I’ll try to do some research on the shipping version of LINQ and see what I can determine by examining the plans that get generated.

That’s enough for one evening from me.