Finding what queries in the plan cache use a specific index

In the last 48 hours I have seen two different people having the exact same problem so I thought that I would go about blogging some code I’ve had lying around for a while and been meaning to blog that would help them with finding the root cause of their problems.  In both cases, the question started because the person asking it noticed that Page Life Expectancy was below 300, which is based on dated information published by Microsoft before I began working with SQL Server, and consequently is something I’ll discuss a little at the end of this blog post.  In the first case I started asking questions about the amount of memory installed, and what was using the data cache in the server, (see Paul’s blog post Performance issues from wasted buffer pool memory).  In the second case this information had already been presented and as a matter of the troubleshooting of both of the problems it was noted that a single index was using a majority of the data pages in the buffer cache on the servers and the question became, “How do I find out what is using this index?”  and a solution that would help them with figuring out what queries are using a specific index in a database. 

Another Case for Querying the Plan Cache?

In the past I have written a number of scripts that can be used to find interesting information in the plan cache of SQL Server and this is just another one of those.  I originally wrote this back at the end of March when my good friend Thomas LaRock (Blog|Twitter) sent me an email asking if I would mind him providing my contact information to someone that had a question about Extended Events. What the person wanted to do was identify all the queries that were using a specific index using Extended Events, but it turned out this isn’t possible in Extended Events.  Bummer…  However I could offer an alternate solution that involved querying the plan cache using XQuery and then monitoring what was found over time to determine a majority of the queries that use a specific index. 

SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
DECLARE @IndexName AS NVARCHAR(128) = 'PK__TestTabl__FFEE74517ABC33CD';

— Make sure the name passed is appropriately quoted
IF (LEFT(@IndexName, 1) <> '[' AND RIGHT(@IndexName, 1) <> ']') SET @IndexName = QUOTENAME(@IndexName);
–Handle the case where the left or right was quoted manually but not the opposite side
IF LEFT(@IndexName, 1) <> '[' SET @IndexName = '['+@IndexName;
IF RIGHT(@IndexName, 1) <> ']' SET @IndexName = @IndexName + ']';

— Dig into the plan cache and find all plans using this index
;WITH XMLNAMESPACES
   (DEFAULT '
http://schemas.microsoft.com/sqlserver/2004/07/showplan')   
SELECT
stmt.value('(@StatementText)[1]', 'varchar(max)') AS SQL_Text,
obj.value('(@Database)[1]', 'varchar(128)') AS DatabaseName,
obj.value('(@Schema)[1]', 'varchar(128)') AS SchemaName,
obj.value('(@Table)[1]', 'varchar(128)') AS TableName,
obj.value('(@Index)[1]', 'varchar(128)') AS IndexName,
obj.value('(@IndexKind)[1]', 'varchar(128)') AS IndexKind,
cp.plan_handle,
query_plan
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qp
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS batch(stmt)
CROSS APPLY stmt.nodes('.//IndexScan/Object[@Index=sql:variable("@IndexName")]') AS idx(obj)
OPTION(MAXDOP 1, RECOMPILE);


Now I have been asked this before so I will go ahead and defend it ahead of time.  I use transaction isolation level read uncommitted when querying the plan cache just to make sure I don’t cause any problems because this can be a long running query depending on the size of the plan cache and I don’t care if I get some phantom plan as a result of being read uncommitted.  I also use MAXDOP 1 for the query because it tends to be a CPU resource hog, and a lot of times I am running this on a production system or telling someone to run this on a production system and it has the potential for impacting performance.  If your server is considerably beefy hardware wise, it probably won’t matter, and for most people it doesn’t negatively impact their environment, but like anything else, I apply a rule of “First do no harm” to something that might be run against a production system.

Where have I used this code since originally writing it back in March?  All over the place, but generally when working with index tuning so I can find what is using a specific index in the system to evaluate what impact changing that index or removing it might have to the environment.  It is also useful for troubleshooting the problem that a specific index is using all of the buffer pool and causing buffer pool flushing to occur on a constant basis resulting in a low Page Life Expectancy for the server.  Using this query you can easily find the plans and statements that have been using this index and begin working to review the query stats information to find out where the problem is in the code, design, or indexing to try and improve the situation.

What’s Wrong about Page Life Expectancy >= 300?

Aside from being a performance metric that was recommended by Microsoft that has never been updated to reflect the changes that have occurred in hardware over the last 12 years, there is nothing wrong with this recommendation (i.e. sarcasm… :-)).  The problem with this fixed value is that it was determined when servers generally had 4GB of RAM installed in them, and servers with 16GB+ of RAM installed in them were extremely expensive and fairly rare. 

This performance counter tells you the number of seconds, at the current point in time, a page will remain in memory without being referenced (so a value of 300 means your buffer pool is flushing every 5 minutes).  So 10 years ago when you were reading anywhere from 1.7GB up to 12GB of data (depending on your server’s memory) from disk into the buffer cache every 5 minutes it was a sign of memory pressure on the server and something you needed to investigate. 

Fast forward to today, where it is not uncommon for a SQL Servers to have anywhere from 48-144GB+ RAM installed in them.  These RAM values equate to 32-132GB of buffer cache size depending on the ‘max server memory’ sp_configure option setting and the amount of memory being used for the plan cache, but you probably get the point by now.  If reading 1.7GB-12GB of data every 5 minutes was bad, how bad would it have to be to read 32GB-132GB of data from disk every 5 minutes consistently? 

Today the value 300 is ridiculously small as a threshold for when to start worrying about buffer pool pressure.

Now from my own experiences as a DBA and over the past 6 months as a consultant, the I/O subsystem is one of the most undersized components in a majority of SQL Server implementations, so the last thing I want to be doing is hammering the disks in my server because I relied on a performance counter that was out of date years ago but never updated.  For the last 3-4 years I have relied on the amount of memory being used by the data cache in SQL Server to determine when Page Life Expectancy was a sign of impending problems.  This means I replaced the old 300 threshold with a more reasonable value of (DataCacheSizeInGB/4GB *300) as a basis for when to begin investigating things. 

Keep in mind that short drops in Page Life Expectancy can be the result of a large query executing and are not a sign of problems in the system (though it may be a sign that you have a tuning opportunity).

Summary: don’t use the value 300 as any kind of Page Life Expectancy threshold. Anyone that continues to recommend doing so is doing you a disservice. Use an adaptive formula like (DataCacheSizeInGB/4GB*300). Even better – monitor the steady-state value of PLE and react when it dips *and stays* below your steady-state value.

33 thoughts on “Finding what queries in the plan cache use a specific index

  1. Very nice post Jonathan. I completely agree about watching the trend, compared to your baselines, for PLE, rather than focusing on the current value.

    I really like to look at Memory Grants Pending and Memory Grants Outstanding as a good way to confirm severe memory pressure. If they are above zero for any length of time, that is a very strong indicator of memory pressure.

  2. This displays duplicate rows. I can’t just add a DISTINCT to the SELECT because of the XML. Is there a good way to eliminate the display of duplicates?

  3. Hey Mark,

    If it is showing duplicate, then you have multiple plans with the same queries in them which would occur for an adhoc workload, or the index is referenced multiple times in the same plan, which also happens if you have multiple references to the same table. You can drive down on this a little by doing a CAST of the query_plan from XML to NVARCHAR(MAX) and then running the DISTINCT as a derived table and then going back to XML outside.

    SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
    DECLARE @IndexName AS NVARCHAR(128) = ‘PK__TestTabl__FFEE74517ABC33CD’;

    — Make sure the name passed is appropriately quoted
    IF (LEFT(@IndexName, 1) <> ‘[‘ AND RIGHT(@IndexName, 1) <> ‘]’) SET @IndexName = QUOTENAME(@IndexName);
    –Handle the case where the left or right was quoted manually but not the opposite side
    IF LEFT(@IndexName, 1) <> ‘[‘ SET @IndexName = ‘[‘+@IndexName;
    IF RIGHT(@IndexName, 1) <> ‘]’ SET @IndexName = @IndexName + ‘]’;

    — Dig into the plan cache and find all plans using this index
    ;WITH XMLNAMESPACES
    (DEFAULT ‘http://schemas.microsoft.com/sqlserver/2004/07/showplan’)
    SELECT
    SQL_Text,
    DatabaseName,
    SchemaName,
    TableName,
    IndexName,
    IndexKind,
    plan_handle,
    CAST(query_plan AS XML) AS query_plan
    FROM
    (
    SELECT DISTINCT
    stmt.value(‘(@StatementText)[1]’, ‘varchar(max)’) AS SQL_Text,
    obj.value(‘(@Database)[1]’, ‘varchar(128)’) AS DatabaseName,
    obj.value(‘(@Schema)[1]’, ‘varchar(128)’) AS SchemaName,
    obj.value(‘(@Table)[1]’, ‘varchar(128)’) AS TableName,
    obj.value(‘(@Index)[1]’, ‘varchar(128)’) AS IndexName,
    obj.value(‘(@IndexKind)[1]’, ‘varchar(128)’) AS IndexKind,
    cp.plan_handle,
    CAST(query_plan AS NVARCHAR(MAX)) AS query_plan
    FROM sys.dm_exec_cached_plans AS cp
    CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qp
    CROSS APPLY query_plan.nodes(‘/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple’) AS batch(stmt)
    CROSS APPLY stmt.nodes(‘.//IndexScan/Object[@Index=sql:variable("@IndexName")]’) AS idx(obj)
    ) AS tab
    OPTION(MAXDOP 1, RECOMPILE);

  4. Regarding the 300 threshold; Microsoft is still referring to it in its "SQL Server Risk Assessment Program". Can somebody tell them to stop that?
    [quote]Page life expectancy tells you, on average, how long a data page will remain in the buffer pool.
    If this value is too low (for example, less than 300), it indicates that there is pressure on the
    buffer pool because of incoming page reads. There is the risk that additional physical I/O is
    required to read pages from disk instead of reading them in the buffer pool.
    [/quote]

    But I rater use your formula which makes more sense, thanks. I’m keeping statics on the PLE value myself.

  5. Hmmm – I guess I’m confused then. Above in the post, you said the formula was written because it dealt with servers with "anywhere from 48-144GB+ RAM installed." Is that what you’re seeing as the "normal SQL Server environment for the readers"? It sounds like you’re seeing most readers have over 48GB of RAM coupled with poorly performing IO subsystems, and yeah, in that case, the formula makes perfect sense. That just hasn’t been the typical setup I’ve seen in the wild though. Seems like in the wild (not for readers of more sophisticated SQL blogs) that memory numbers are still much lower on average.

  6. Hmm – I’m going to disagree here about 300 being worthless. You note that:

    "Fast forward to today, where it is not uncommon for a SQL Servers to have anywhere from 48-144GB+ RAM installed in them."

    You could say the exact same thing about IO subsystems – it’s not uncommon to see SQL Servers with solid state drives acting as a second-tier buffer pool, tiered storage with hot data staying on SSD, or 10Gb iSCSI with hundreds of drives – especially in servers with 144GB+ of memory on them. Memory capacity/speed didn’t grow in a vacuum.

    Don’t get me wrong, adaptive formulas are great, but when someone’s scanning lots of metrics, *any* of these types of metrics are less relevant anyway. You’re better off checking wait stats first.

  7. Brent,

    Even if you have a SSD to act as a "second-tier buffer pool", which the majority of SQL Servers out there don’t, relying on 300 seconds by the math just doesn’t work out. Lets say you have a database server with 144GB RAM and the data cache is using 120GB of that memory to account for the plan cache and other memory allocations needed by the instance. That would equate to 122880MB/300=409.6MB/sec sustained transfer for Random I/O that the disk I/O subsystem has to maintain constantly. Depending on the SSD you went with that could already be a problem for you. It certainly isn’t a problem for the enterprise class SSDs like the FusionIO PCI-X drives, but if you are going with a cheaper SSD, which is what I see occurring more and more on forums posts and in questions I receive, you’ve already passed the throughput characteristics for the SSD. Are there environments out there that have I/O subsystems that can sustain 409MB/sec of random I/O activity constantly? Absolutely! Is this the normal SQL Server environment for the readers of most blog posts, articles, and forum responses that are trying to determine how to troubleshoot performance problems on their SQL Servers? Absolutely not!

    Continuing to promote a non-adaptive metric for troubleshooting is a disservice to the community because it doesn’t fit the average environment out there.

  8. Hey Brent,

    You must not have clicked the link to the forums thread where the server has 64GB RAM in it. Yes I see it more and more on the forums where servers have 48GB+ RAM and they continue to be told, or in the case of the reference post, use a value of 300 for Page Life Expectancy. I noticed this trend as early as 2 years ago in my forums activities and it has only increased steadily in the last two years. This is one of the things I have liked most about answering questions on the forums over the last four years is that it helps to keep me grounded in what the actual implementations of SQL Server are in general, rather than only thinking in terms of what I work with day to day.

    It is really easy, and by comparison to just a few years ago cheap, to get a 12-24 core server with 48GB+ RAM installed in it. However, those servers only come with 8 internal disks in most cases. At best you can get 24 internal disks if you choose one that has all 2.5" drives across the front. Even environments with a DAS that holds 14-15 disks externally, is not going to be able to hit the I/O numbers previously mentioned by calculating this out.

    Once you start getting above 16-32GB RAM the 300 value easily begins to exceed the MB/sec throughput of most DAS setups I see repeatedly on the forums and in consulting work. Take 32GB RAM for example, of you say 28GB of that is data cache that equates to 95.5MB/sec constant random access, which is not impossible for a DAS, a 5-6 disk 15K RPM RAID 5 array will peak out somewhere close to here for 64KB I/O block sizes, but then you’ve bottlenecked your disk subsystem. Now you can certainly scale the disks out to a 14 disk RAID 10 in a DAS like a MD3000 and you get somewhere in the range of 380MB/sec depending on how you configure the array but that is going to cost a lot more to upgrade to and if you go the iSCSI route you still have path limitations to deal with.

  9. How is that surprising given their cost, and that they have only become the hot topic in the last year and a half? I see more and more questions about the on the forums but cost of enterprise SLC NAND based SSDs is still a factor in the small to medium business environments. I would not recommend that someone use a desktop grade SSD for their SQL Server any more than I would let my kids play with knives.

  10. We have some customers using SSDs, especially where they’re the only option (e.g. casino software that takes years literally to get approval from states to make changes), plus a few banks, and some smaller customers. Usually we can tune out the performance issues people are seeing way more cheaply than just blindly recommending SSDs to solve perf issues – teach a man to fish rather than giving them a fish and all that.

    1. Hi Paul, I know this is a older article, but seeing that it is 2016 now, I was wondering if you would agree, disagree, or maybe qualify this article as useful depending on the situation. In my case most the of the SQL servers I have to support are virtual servers. Your thoughts?

  11. Hi Jonathon,
    Great article. It really helped me consolidation of indexes.
    I am having one little problem. Some of my queries are truncated.
    Is it the limitation of DMV?
    I am on SQL Server 2008 R2.

    Thanks again for this great piece of work.

    Regards,
    Najm

  12. Hi, Jonathan, i am not quite understand,reason to divide DataCachesize by 4 and multiply with 300 .
    so what you get output in seconds, am right.
    Can you explain little more to digest this formual?

  13. Hi Jonathan, I’ve got the same question as Sreeni. If i understand correctly: I have 34663 MB of Cached size(according to your script), So 34663/4 * 300 = 2599725 sec? would be the number PLE should be. That in hours is 722 or ruffly 30 day’s. Isn’t that a little over the top to shoot for?

  14. We have 256GB memory on some of our servers and that means the PLE threshold value > 5 hrs, which seems too much to define as a threshold below which we start worrying about memory. Is it time to revisit the formula given in this article and probably set an upper limit to the threshold?

    1. Hey Rajeev,

      As I stated in the last line, “monitor the steady-state value of PLE and react when it dips *and stays* below your steady-state value.” Aside from that, benchmark your I/O subsystems peak throughput capabilities at your average worklaod block sizes and then calculate out what PLE would be to cause that much I/O and then adjust up from there for something reasonable for you specific environment. The point is that fixed numbers are horrible for something like PLE because they don’t scale to different hardware sizes or workload requirements.

  15. Great Article! Thanks for sharing this. By the way, I have used the following query to calculate the Cached Data Plan on a specific SQL instance; the query breaks it down by database. If you want the total just add up the numbers.


    Execute sp_msforeachdb @command1 = '
    Select
    ''?'' As DatabaseName,
    DataCacheSizeInMegs = (Sum(cached_pages_count) * 8.00)/1024.00
    From (
    SELECT
    COUNT(*) AS cached_pages_count,
    name AS BaseTableName,
    IndexName,
    IndexTypeDesc
    FROM ?.sys.dm_os_buffer_descriptors AS bd
    INNER JOIN(
    SELECT
    s_obj.name,
    s_obj.index_id,
    s_obj.allocation_unit_id,
    s_obj.OBJECT_ID,
    i.name IndexName,
    i.type_desc IndexTypeDesc
    FROM(
    SELECT
    OBJECT_NAME(OBJECT_ID) AS name,
    index_id,
    allocation_unit_id,
    OBJECT_ID
    FROM ?.sys.allocation_units AS au
    INNER JOIN ?.sys.partitions AS p ON au.container_id = p.hobt_id AND (au.TYPE = 1 OR au.TYPE = 3)
    UNION ALL
    SELECT
    OBJECT_NAME(OBJECT_ID) AS name,
    index_id,
    allocation_unit_id,
    OBJECT_ID
    FROM ?.sys.allocation_units AS au
    INNER JOIN ?.sys.partitions AS p ON au.container_id = p.partition_id AND au.TYPE = 2
    ) AS s_obj
    LEFT JOIN ?.sys.indexes i ON i.index_id = s_obj.index_id AND i.OBJECT_ID = s_obj.OBJECT_ID
    ) AS obj ON bd.allocation_unit_id = obj.allocation_unit_id
    WHERE database_id = DB_ID(''?'')
    GROUP BY name, index_id, IndexName, IndexTypeDesc
    ) Res
    '

  16. I knew I wasn’t crazy when I thought that PLE of 300 was really low for 1TB of RAM. Do you have any recommendations for systems with more than 500GB of RAM? When I use your formula (1000GB/4GBx300) I get 75,000 seconds (a little less than 21 hours), which seems like a lot to expect from a system these days even with 1TB of RAM. *I’m rounding the 1TB, SQL gets about ~950GB just easier on the math*

    Thanks, this is great info.

  17. Hi Jonathan,

    Great Article about PLE threshold. We are using Windows Server 2003 R2 and SQL Server 2005 with 32 GB Ram and 20600MB has been given to SQL Server MAX memory. Data Cache is taking around 15GB. How much would be the PLE threshold in my case. Thanks in advance.

    Thanks,
    Pavan

  18. Dear Jonathan sir
    I have SQL Server 2005(I am beginner level) and I am getting errors when i run the script to determine queries that use a specific index. Could you please share the script for SQL 2005.

    Thanks and Regards,
    Chaitanya

  19. Hi, I know this is an old post, but I am a small question. Documentation for Page life expectancy says “remain in memory without being referenced”.

    What does the words being referenced means? Does it mean that if I have 10.000 pages in memory and a PLE of 300 (the NOT word), then all 10.000 pages are flushed within 300 seconds, or does it mean that if we have 1.000 pages that are not used, will be flushed within 300 seconds?

    Thanks in advance.

    1. The buffer pool uses the LRU-K algorithm to track the last K times a page was referenced. In SQL Server K = 2, so it knows the last two times a page was used in the buffer pool and it compares the TLA (time of last access) against the system wide threshold that is based on the current workload and demand for memory and that determines if a page is going to be evicted from the buffer pool or not. If the workload is stable and no data is being read from disk, there is no adjustment of the system wide threshold for TLA and pages will remain until the buffer pool needs BUF structures on the free list and begins removing stale (unused) pages based on the TLA and current threshold.

  20. In the formula (DataCacheSizeInGB/4GB*300) does the ‘4GB’ represent a static value or is it dependent on some factor?

    1. It’s just a reference point to show that the 300 value should shift. What is normal and when do you see deviations from normal? That’s the best way to monitor a system rather than applying some arbitrary number to it that doesn’t have a solid basis. How fast is the I/O subsystem? How low does PLE have to drop to before the I/O subsystem is under pressure and taxed for the environment?

  21. Jonathan, in your article (https://thomaslarock.com/2012/05/are-you-using-the-right-sql-server-performance-metrics/), you reference a better metric “BUFFER POOL I/O RATE”.

    SELECT (1.0*cntr_value/128) /
    (SELECT 1.0*cntr_value
    FROM sys.dm_os_performance_counters
    WHERE object_name like ‘%Buffer Manager%’
    AND lower(counter_name) = ‘Page life expectancy’)
    AS [BufferPoolRate]
    FROM sys.dm_os_performance_counters
    WHERE object_name like ‘%Buffer Manager%’
    AND counter_name = ‘database pages’

    –BuffBuffer pool I/O rate

    “Because if I have a “typical” server with 56GB of RAM available for the buffer pool (thank you, locked pages in memory) and I want to keep my pages around for an hour or so (3600 seconds), then I come up with 56000MB/3600sec, or about 15.5 MB/sec. That’s why I look for a sustained rate of about 20 on average, and if I have a spike upward from there then I know I am having memory pressure (pressure that might otherwise fail to be seen if I only examine the PLE counter).”

    I like that formula… Buffer Pool Size MB/3600 seconds gives rate MB/sec.

    Does that still rate as a good formula today?

  22. can the query be run in SQLCMD?

    SELECT stmt.value(‘(@StatementText)[1]’, ‘varchar(max)’) AS SQL_Text, obj.value(‘(@Database)[1]’, ‘varchar(128)’) AS DatabaseName FROM sys.dm_exec_cached_plans AS cp CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qp CROSS APPLY query_plan.nodes(‘/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple’) AS batch(stmt) CROSS APPLY stmt.nodes(‘.//IndexScan/Object[@Index=”toto_id”]’) AS idx(obj)
    go
    Msg 1934, Level 16, State 1, Server EC2AMAZ-DQIQH2H, Line 1
    SELECT failed because the following SET options have incorrect settings: ‘QUOTED_IDENTIFIER’. Verify that SET

    Thanks
    Kyle

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.