In the last 48 hours I have seen two different people having the exact same problem so I thought that I would go about blogging some code I’ve had lying around for a while and been meaning to blog that would help them with finding the root cause of their problems. In both cases, the question started because the person asking it noticed that Page Life Expectancy was below 300, which is based on dated information published by Microsoft before I began working with SQL Server, and consequently is something I’ll discuss a little at the end of this blog post. In the first case I started asking questions about the amount of memory installed, and what was using the data cache in the server, (see Paul’s blog post Performance issues from wasted buffer pool memory). In the second case this information had already been presented and as a matter of the troubleshooting of both of the problems it was noted that a single index was using a majority of the data pages in the buffer cache on the servers and the question became, “How do I find out what is using this index?” and a solution that would help them with figuring out what queries are using a specific index in a database.
Another Case for Querying the Plan Cache?
In the past I have written a number of scripts that can be used to find interesting information in the plan cache of SQL Server and this is just another one of those. I originally wrote this back at the end of March when my good friend Thomas LaRock (Blog|Twitter) sent me an email asking if I would mind him providing my contact information to someone that had a question about Extended Events. What the person wanted to do was identify all the queries that were using a specific index using Extended Events, but it turned out this isn’t possible in Extended Events. Bummer… However I could offer an alternate solution that involved querying the plan cache using XQuery and then monitoring what was found over time to determine a majority of the queries that use a specific index.
SET TRANSACTION ISOLATION LEVEL READ UNCOMMITTED
DECLARE @IndexName AS NVARCHAR(128) = 'PK__TestTabl__FFEE74517ABC33CD';
— Make sure the name passed is appropriately quoted
IF (LEFT(@IndexName, 1) <> '[' AND RIGHT(@IndexName, 1) <> ']') SET @IndexName = QUOTENAME(@IndexName);
–Handle the case where the left or right was quoted manually but not the opposite side
IF LEFT(@IndexName, 1) <> '[' SET @IndexName = '['+@IndexName;
IF RIGHT(@IndexName, 1) <> ']' SET @IndexName = @IndexName + ']';
— Dig into the plan cache and find all plans using this index
stmt.value('(@StatementText)', 'varchar(max)') AS SQL_Text,
obj.value('(@Database)', 'varchar(128)') AS DatabaseName,
obj.value('(@Schema)', 'varchar(128)') AS SchemaName,
obj.value('(@Table)', 'varchar(128)') AS TableName,
obj.value('(@Index)', 'varchar(128)') AS IndexName,
obj.value('(@IndexKind)', 'varchar(128)') AS IndexKind,
FROM sys.dm_exec_cached_plans AS cp
CROSS APPLY sys.dm_exec_query_plan(plan_handle) AS qp
CROSS APPLY query_plan.nodes('/ShowPlanXML/BatchSequence/Batch/Statements/StmtSimple') AS batch(stmt)
CROSS APPLY stmt.nodes('.//IndexScan/Object[@Index=sql:variable("@IndexName")]') AS idx(obj)
OPTION(MAXDOP 1, RECOMPILE);
Now I have been asked this before so I will go ahead and defend it ahead of time. I use transaction isolation level read uncommitted when querying the plan cache just to make sure I don’t cause any problems because this can be a long running query depending on the size of the plan cache and I don’t care if I get some phantom plan as a result of being read uncommitted. I also use MAXDOP 1 for the query because it tends to be a CPU resource hog, and a lot of times I am running this on a production system or telling someone to run this on a production system and it has the potential for impacting performance. If your server is considerably beefy hardware wise, it probably won’t matter, and for most people it doesn’t negatively impact their environment, but like anything else, I apply a rule of “First do no harm” to something that might be run against a production system.
Where have I used this code since originally writing it back in March? All over the place, but generally when working with index tuning so I can find what is using a specific index in the system to evaluate what impact changing that index or removing it might have to the environment. It is also useful for troubleshooting the problem that a specific index is using all of the buffer pool and causing buffer pool flushing to occur on a constant basis resulting in a low Page Life Expectancy for the server. Using this query you can easily find the plans and statements that have been using this index and begin working to review the query stats information to find out where the problem is in the code, design, or indexing to try and improve the situation.
What’s Wrong about Page Life Expectancy >= 300?
Aside from being a performance metric that was recommended by Microsoft that has never been updated to reflect the changes that have occurred in hardware over the last 12 years, there is nothing wrong with this recommendation (i.e. sarcasm… :-)). The problem with this fixed value is that it was determined when servers generally had 4GB of RAM installed in them, and servers with 16GB+ of RAM installed in them were extremely expensive and fairly rare.
This performance counter tells you the number of seconds, at the current point in time, a page will remain in memory without being referenced (so a value of 300 means your buffer pool is flushing every 5 minutes). So 10 years ago when you were reading anywhere from 1.7GB up to 12GB of data (depending on your server’s memory) from disk into the buffer cache every 5 minutes it was a sign of memory pressure on the server and something you needed to investigate.
Fast forward to today, where it is not uncommon for a SQL Servers to have anywhere from 48-144GB+ RAM installed in them. These RAM values equate to 32-132GB of buffer cache size depending on the ‘max server memory’ sp_configure option setting and the amount of memory being used for the plan cache, but you probably get the point by now. If reading 1.7GB-12GB of data every 5 minutes was bad, how bad would it have to be to read 32GB-132GB of data from disk every 5 minutes consistently?
Today the value 300 is ridiculously small as a threshold for when to start worrying about buffer pool pressure.
Now from my own experiences as a DBA and over the past 6 months as a consultant, the I/O subsystem is one of the most undersized components in a majority of SQL Server implementations, so the last thing I want to be doing is hammering the disks in my server because I relied on a performance counter that was out of date years ago but never updated. For the last 3-4 years I have relied on the amount of memory being used by the data cache in SQL Server to determine when Page Life Expectancy was a sign of impending problems. This means I replaced the old 300 threshold with a more reasonable value of (DataCacheSizeInGB/4GB *300) as a basis for when to begin investigating things.
Keep in mind that short drops in Page Life Expectancy can be the result of a large query executing and are not a sign of problems in the system (though it may be a sign that you have a tuning opportunity).
Summary: don’t use the value 300 as any kind of Page Life Expectancy threshold. Anyone that continues to recommend doing so is doing you a disservice. Use an adaptive formula like (DataCacheSizeInGB/4GB*300). Even better – monitor the steady-state value of PLE and react when it dips *and stays* below your steady-state value.