Troubleshooting Performance in Azure SQL

At some point, whether you’re a DBA, developer, or application administrator, you’re going to find yourself troubleshooting performance in Azure.  Within the Azure Portal you have the capability to look at performance for an Azure SQL via the Performance overview and Query Performance insight pages.

Checking Performance in the Azure Portal

Checking Performance in the Azure Portal

If you haven’t worked with SQL Server previously, this is a great place to start as it provides a high level view of system performance, broken out by DTUs, CPU, and data and log I/O.  You can dig into specific queries that might be causing issues through Query Performance Insight, including query text and resource use over time.

But if you’re familiar with troubleshooting in SQL Server, this method feels different, and slower.  Not to worry, many of the tools you’re familiar with still work in Azure SQL, you just need a few tweaks.

Wait Statistics

Whenever I’m dropping into a server – whether it’s one I’m familiar with or not – to look at a performance issue, I want to understand wait statistics.  Specifically, I want to know what wait statistics while the problem is occurring.  For this reason, I don’t just query sys.dm_os_wait_stats, I snapshot it, using Paul’s script from his post Capturing wait statistics for a period of time.

You can use this as-is in Azure SQL, but, since it queries sys.dm_os_wait_ stats, you’re not looking at wait statistics specific to your database, which is really what you care about.  So you have to change sys.dm_os_wait_ stats to sys.dm_db_wait_stats in the script.  Once you make that switch, you’re good to go – although depending on the issue you’re dealing with you may change the delay between snapshots to something lower than 30 minutes, which is what Paul has in the script.  I sample for either 5 and 10 minutes, depending on what else I want to check.  This also assumes the problem is occurring WHILE I am capturing wait statistics.

WhoIsActive

I think I’ve bought Adam Machanic drinks a few times, but at this point I feel like I owe him several dinners at high end restaurants.  While I’m waiting for the wait statistics snapshot to complete, I like to run WhoIsActive.  You want to grab that from his blog, rather than GitHub.  If you read the post, there was an issue with the original version of the script, fixed here in v2.  Note that when you create the stored procedure you want to do that in your user database.  It will fail if you try to create it in master.  Once it’s created, you can use all your favorite parameters for execution:

EXEC sp_WhoIsActive;   /* default, what's running right now */
GO

EXEC sp_WhoIsActive @find_block_leaders = 1;   /* list output with lead blocking sessions at the top */
GO

EXEC sp_WhoIsActive @get_plans = 1;   /* pull execution plans */

If you haven’t worked with WhoIsActive much, I’ll refer to you the documentation.  Between wait statistics and WhoIsActive, I usually get an idea of where to look next.  And this is where the next script comes in handy…

SQL Server Diagnostic Queries

You can get the current version of Glenn’s Diagnostic Queries for each release on his Resources page.  Note that Azure SQL and Managed Instances have separate scripts.  With the Diagnostics Queries in hand, I can dig deeper into what I think might be the issue, or I can just start rolling through the queries in order to check configuration and performance data.  I may end up on some tangents, depending on what I find, but these three scripts together provide a great starting point when troubleshooting performance in Azure, whether I have a complaint that the system is slow overall, or I’m told that a particular set of queries is slow. Happy troubleshooting!

Figuring Out When Wait Statistics Were Last Cleared

One of the topics covered during IE2 is Waits and Queues, and this week one of the attendees asked if there was a way to know when wait statistics were last cleared. It turns out there is a post about it on the SQLOS & Cloud Infrastructure Team Blog, and what you do is look for the SQLTRACE_INCREMENTAL_FLUSH_SLEEP wait_type in sys.dm_os_wait_stats. The value for wait_tims_ms is the time since wait stats were last cleared.

Since I talk about clearing wait stats in my DBCC session I decided this information would be useful to include, and then decided to just write a query to figure it out.  The query is below, as is a query to show the last time the instance was restarted. Edit: Note that this query is only for SQL Server 2008R2 and higher.  The SQLTRACE_INCREMENTAL_FLUSH_SLEEP wait_type did not exist in previous versions of SQL Server.  If I can find an alternate wait_type for use, I will post back.

Just to be clear, I am not advocating that you immediately clear wait stats on your server.  In fact, I’d be more interested to know whether wait stats have been cleared since the instance was last restarted.  Let me back up a bit…  If you’re not familiar with wait statistics, I suggest starting with Paul’s post, Wait statistics, or please tell me where it hurts.  Whether you have performance issues on your server or not, I highly recommend looking at wait stats regularly to understand what your “normal” waits are.  If you are having issues, waits can be a starting point to sorting out what’s causing the problem.  But wait stats should not be viewed in a vacuum – always use other counters or information to correlate findings.  Once you have a basic understanding of your wait stats, you may decide to capture them on a regular basis.  But wait stats are cumulative, and will only reset when you restart the instance, or if you clear them manually.  They can be cleared using:

DBCC SQLPERF ('sys.dm_os_wait_stats', CLEAR);

There are a variety of reasons to clear wait stats, but it shouldn’t be done arbitrarily.  You may clear them because you’ve made a change to the system and want to understand if and how that change affects wait statistics.  You may clear wait stats every morning at 5 AM because backups and other maintenance tasks run at night, but you want to understand performance during business hours when users are the system and therefore decide to filter out (or investigate separately) waits from those tasks.  In the latter case, you probably have a scheduled job to clear the wait stats, and other DBAs may or may not know about this job.  If another DBA decides to clear wait stats for some reason, without telling you, they may affect the wait information you’ve been capturing.  As such, I would always verify with colleagues whether or not wait stats are cleared regularly and this script will give you an idea of whether are being cleared or not.  However, even if they haven’t been cleared since the last restart, I would still perform some due diligence and verify it won’t alter anyone else’s data.  I am interested to hear different ideas for how this information could be used, so feel free to post back or email me.  And if you’re not currently looking at your wait stats regularly, please start!

/* when were wait stats last cleared? */
SELECT
[wait_type],
[wait_time_ms],
DATEADD(ms,-[wait_time_ms],getdate()) AS [Date/TimeCleared],
CASE
WHEN [wait_time_ms] < 1000 THEN CAST([wait_time_ms] AS VARCHAR(15)) + ' ms'
WHEN [wait_time_ms] between 1000 and 60000 THEN CAST(([wait_time_ms]/1000) AS VARCHAR(15)) + ' seconds'
WHEN [wait_time_ms] between 60001 and 3600000 THEN CAST(([wait_time_ms]/60000) AS VARCHAR(15)) + ' minutes'
WHEN [wait_time_ms] between 3600001 and 86400000 THEN CAST(([wait_time_ms]/3600000) AS VARCHAR(15)) + ' hours'
WHEN [wait_time_ms] > 86400000 THEN CAST(([wait_time_ms]/86400000) AS VARCHAR(15)) + ' days'
END [TimeSinceCleared]
FROM [sys].[dm_os_wait_stats]
WHERE [wait_type] = 'SQLTRACE_INCREMENTAL_FLUSH_SLEEP';

/* check SQL Server start time - 2008 and higher */
SELECT
[sqlserver_start_time]
FROM [sys].[dm_os_sys_info];