There are a huge number of best practices around SQL Server performance tuning – I could easily write a whole book on the topic, especially when you consider the number of different database settings, server settings, coding practices, wait types, and so on that can affect performance. For this post I decided to step back a bit from a list of specifics and list some general recommendations for how to approach performance tuning so that you maximize effort and minimize distractions.
1) Don’t Assume the Symptom Is the Root Cause
Many DBAs and developers tend towards what I call ‘knee-jerk performance troubleshooting’, where a minimal amount of analysis and investigation is performed and the assumption is made that the most prevalent symptom of poor performance must be the root cause. When this happens, and effort is made to try to address the supposed root cause, it can lead to a lot of wasted time, and frustration that the mitigation efforts don’t help the situation.
My favorite example of this, and a problem I’m sure you’ve all had, is when average disk latency is high. The classic knee-jerk reaction is that it must be the I/O subsystem that has a problem, so the company spends money on a better I/O subsystem and the problem goes away for a little while and then comes back again, because the problem is not the hardware itself, but something happening within SQL Server.
Check It Out!
SQLskills Complete Short Course Bundle
$4,575.00 Original price was: $4,575.00.$899.00Current price is: $899.00.
For a case like this, it’s generally better to take a mental step back and ask why is SQL Server overloading the I/O subsystem or more precisely, why is SQL Server doing so many physical reads. There are many reasons this could be happening, such as (but not limited to):
- An inefficient query plan doing a large, parallel table scan instead of using a nonclustered index because of something like a missing index, or implicit conversion, or out-of-date statistics
- Memory pressure on the buffer pool (meaning there isn’t enough space to hold the usual ‘working set’ of database pages) from the OS
It always pays to do some investigation instead of jumping to a quick conclusion on the root cause.
2) Determine the Scope of the Problem
It’s extremely important to figure out what the scope of the problem is, as that determines how you’ll go about investigating the problem, what metrics to gather, and what scripts and tools to use. For instance, being asked to investigate stored procedure XYZ which takes twice as long to run as it usually does is very different from being asked to tune all long-running stored procedures.
Stored procedure metrics can be obtained by running the query in Management Studio, and noting duration, CPU, and IO statistics. That information can also be obtained from the plan cache, and you can also leverage the plan cache when you need to find the longest-running stored procedures. The following query, adapted from the popular set of DMV scripts here, lists the slowest 25 procedures, based on average duration:
SELECT
TOP (25) [p].[name] AS [SP Name],
[eps].[min_elapsed_time],
[eps].[total_elapsed_time] / [eps].[execution_count] AS [avg_elapsed_time],
[eps].[max_elapsed_time],
[eps].[last_elapsed_time],
[eps].[total_elapsed_time],
[eps].[execution_count],
ISNULL ([eps].[execution_count] /
DATEDIFF (MINUTE, [eps].[cached_time], GETDATE ()), 0) AS [Executions/Minute],
FORMAT ([eps].[last_execution_time],
'yyyy-MM-dd HH:mm:ss', 'en-US') AS [Last Execution Time],
FORMAT ([eps].[cached_time],
'yyyy-MM-dd HH:mm:ss', 'en-US') AS [Plan Cached Time]
-- ,[qp].[query_plan] AS [Query Plan] -- Uncomment if you want the query plan
FROM sys.procedures AS [p] WITH (NOLOCK)
INNER JOIN sys.dm_exec_procedure_stats AS [eps] WITH (NOLOCK)
ON [p].[object_id] = [eps].[object_id]
CROSS APPLY sys.dm_exec_query_plan ([eps]. [plan_handle]) AS [qp]
WHERE
[eps].[database_id] = DB_ID ()
AND DATEDIFF (MINUTE, [eps].[cached_time], GETDATE()) > 0
ORDER BY [avg_elapsed_time] DESC
OPTION (RECOMPILE);
There are also tools like the Top SQL functionality in SolarWinds SQL Sentry that can help identify highest impact and highest resource using queries.
3) Define the Goal for Success
Once you have the scope of the problem, the next step is to determine the goal of the performance tuning effort, so you know when you’ve achieved success and can move on to another task. Don’t allow the goal to be something undefined and open-ended like ‘stored procedure XYZ needs to be faster’, it needs to be well-defined such as ‘stored procedure XYZ needs to run at the speed it did before, i.e. at 50% of the current elapsed time’.
Sometimes the investigation will be a bit more involved if the scope is wider, requiring capturing metrics and information over time before any analysis and mitigation can start. For instance, one of the first consulting clients I worked with had a somewhat open-ended goal for me which was, paraphrasing, ‘tempdb runs out of space once a week, and we need it not to do that’ without any idea why. The investigation involved me setting up two SQL Agent jobs; one every 10 seconds to look for large uses of tempdb and log information to a table, and another once an hour to email me any results from the previous hour. The general code I wrote to find space-hogs in tempdb is below:
-- InternalMB/Pages: worktables (cursor, spool) , workfiles (hash joins), sort
-- UserMB/Pages: everything else
--
SELECT
GETDATE () AS [Date],
[tsu].[session_id] AS [SessionID],
[tsu].[exec_context_id] AS [ExecContextID], -- anything over 0 means parallelism
([tsu].[user_objects_alloc_page_count] -
[tsu].[user_objects_dealloc_page_count]) AS [UserPages],
ROUND (CONVERT (FLOAT, ([tsu].[user_objects_alloc_page_count] -
[tsu].[user_objects_dealloc_page_count]) * 8) / 1024.0, 2) AS [UserMB],
([tsu].[internal_objects_alloc_page_count] -
[tsu].[internal_objects_dealloc_page_count]) AS [InternalPages],
ROUND (CONVERT (FLOAT, ([tsu].[internal_objects_alloc_page_count] -
[tsu].[internal_objects_dealloc_page_count]) * 8) / 1024.0, 2) AS [InternalMB],
[er].[plan_handle] AS [Plan],
[est].[text] AS [Text]
FROM sys.dm_db_task_space_usage [tsu]
JOIN sys.dm_exec_requests [er]
ON [er].[session_id] = [tsu].[session_id]
CROSS APPLY sys.dm_exec_sql_text ([er].[sql_handle]) [est]
/*
WHERE
-- Optionally, filter by a size limit
-- E.g., the 16384 is 128MB in 8KB pages
(([user_objects_alloc_page_count] - [user_objects_dealloc_page_count]) +
([internal_objects_alloc_page_count] - [internal_objects_dealloc_page_count])) >= 16384
*/
ORDER BY
(([user_objects_alloc_page_count] - [user_objects_dealloc_page_count]) +
([internal_objects_alloc_page_count] - [internal_objects_dealloc_page_count])) DESC;
4) Understand the Limitations
Before you start proposing or making changes, it’s important to know if there are any things you simply cannot do. Here are some examples:
- If the application is written by a vendor, you’re not going to be able to make code changes to improve performance
- If the application is written by a vendor, you might not even be able to add or change indexes without voiding the vendor’s support agreement
- You might not be able to change a setting like MAXDOP or parameter sniffing for the whole server, which may mean using an ALTER DATABASE SCOPED CONFIGURATION option for just one database
Even if you can change code, there may be a lengthy testing process which prevents a change from being immediately implemented, so you may need to pursue alternative solutions (potentially short-term) to quickly fix the problem.
5) Change One Thing at a Time
One of the most confusing things to do when performance tuning is to make multiple changes at the same time, as then you won’t know which change had an effect, or whether multiple changes cancelled each other out. Always change one thing at a time and keep a note of what you changed and what effect it had, if any. Also, if a change doesn’t have any effect then revert the change so that it doesn’t become a complication if the workload evolves at a later date.
6) Do Not Test in Production
One of the worst things to do when performance tuning is to make changes directly in production, as that can lead to dire consequences for the workload and business if a change create a huge negative effect. This means you need a separate test/QA environment that can be used to evaluate changes under production workload conditions, or as close to it as possible. And that leads nicely into the next point…
7) Understand How Test Compares to Production
If your test system doesn’t compare to production then you may not see the same change in performance in production as you do in test. Classic examples of this include:
- A production system with a certain number of CPUs (e.g. four 8-core processors) and a lower powered test system to save money (e.g. four quad-core processors)
- Along the same lines, test system having a lot less memory than production, or a different NUMA configuration, or a lower-rated storage subsystem
- Test system only having a subset of the production data to test with
- Test system not being able to simulate the production workload
All of these things can result in the test system producing different query plans, or the workload in test having very different characteristics than in production. This means you’ll be performance tuning for a different workload and environment and the efficacy of the changes may not translate to the production environment.
8) Understand the Implications of the Change
After you’ve determined what the necessary change is, you need to consider what wider effect, if any, making that change will have. For example, if you need to change MAXDOP or the cost threshold for parallelism, that will flush the plan cache, and you might run the risk of parameter-sensitive queries recompiling with sub-optimal plans.
Other changes might be more environmental, like offloading parts of a query workload to a readable secondary in an availability group. That can lead to index fragmentation issues on the primary database, which can be a performance problem of their own (as I described in this SQLPerformance.com post).
You don’t want to solve one performance problem and end up with an unexpected different problem to then have to solve.
9) Create a Rollback Plan
It’s very important that you have a complete log of what’s been changed and have the ability to revert the changes if something goes wrong. This means preserving original copies of all code and schema and ideally having a script you can run to quickly roll back the changes.
If this would be hard to do, and would really entail restoring the database from backups, one thing to consider is creating a database snapshot of the database and keeping it around for a few days. A database snapshot automatically keeps a pre-change copy of all changed data file pages since the time the database snapshot was created and allows you to effectively put the database back to that time with a one-line T-SQL command (internally SQL Server does this by pushing the pre-change pages back into the real database – called ‘reverting the database to the database snapshot’).
10) Remove Diagnostic Elements from Production
Once you’ve finished the investigation and reached the performance tuning goal, make sure you remove all of the diagnostics that you implemented to help with the investigation, as they could cause performance problems themselves if left in place, especially Extended Event sessions as they can become ‘silent killers’ that use up a lot of CPU resources with no other clue that they are the problem.
You can see which Extended Event sessions are running using the following code:
SELECT
[ses].[name] AS [Session Name],
CASE
WHEN [xs].[address] IS NOT NULL THEN 'Running'
ELSE 'Stopped'
END AS [State],
[xs].[create_time] AS [Start Time]
FROM sys.server_event_sessions AS [ses]
LEFT OUTER JOIN sys.dm_xe_sessions AS [xs]
ON [ses].[name] = [xs].[name]
ORDER BY [State], [Start Time];
And if you’re on SQL Server 2025 and using Extended Events, there’s a new MAX_DURATION option you can use to ensure a diagnostic session stops running after a certain amount of time.
Summary
You should always take a step-by-step approach to performance tuning rather than jumping right in and changing things haphazardly in production, and I hope this post has provided you with a simple framework you can put into practice. There’s a lot of code out there to help you with various investigations, plus free tools like Plan Explorer – I can’t recommend this enough! Happy tuning!