Over the last few months I’ve been lecturing at classes and conferences about getting SQL Server’s view of the I/O subsystem and what latencies it is experiencing, so time for a blog post to help everyone else.
Most SQL Server’s today are I/O bound – that’s a generally agreed-on statement by DBAs and consultants in the field. This means that the major factor in server performance is its ability to perform I/O operations quickly. If the I/O subsystem cannot keep up with the demand being placed on it, then the SQL Server workload will suffer performance problems.
Now, saying that, one trap that many people fall into is equating increased I/O subsystem latencies with poor I/O subsystem performance. This is often not the case at all. It’s usually the case that the I/O subsystem performs fine when the designed-for I/O workload is happening, but becomes the performance bottleneck when the I/O workload increases past the I/O subsystem design point. The I/O workload increase is what’s causing the problem, not the I/O subsystem – if you design an I/O subsystem to support 1000 IOPS (I/O operations per second) and SQL Server is trying to push 2000 IOPS, performance is going to suffer.
If you find that I/O latency has increased, look to a change in SQL Server behavior before blaming the I/O subsystem. For instance:
- Query plan changes from out-of-date statistics, code changes, implicit conversions, poor indexing that cause table scans rather than index seeks
- Additional indexes being created that cause increased index maintenance workload and logging
- Access pattern/key changes that cause page splits and hence extra page reads/writes and logging
- Adding change data capture or change tracking that causes extra writes and logging
- Enabling snapshot isolation that causes tempdb I/Os, plus potentially page splits
- Decreased server memory leading to a smaller buffer pool and increased lazy writer and read activity
And a whole host of other reasons can lead to an increased I/O workload where it’s not the I/O subsystem’s fault.
On the other hand, however, it may very well be the I/O subsystem that has an issue if the SQL Server workload is the same. A SAN administrator may have decided to give some of the space on one of the SQL Server LUNs to another server, which can lead to an overload.
Using performance counters, you can see at the physical disk level what the latencies are (looking at the Avg. Disk sec/Read and Avg. Disk sec/Write counters) but if you have your databases spread across a few LUNs, that doesn’t help you pinpointing which database files are experiencing latency issues and driving the most I/Os to the I/O subsystem.
This is where sys.dm_io_virtual_file_stats comes in. It was introduced in SQL Server 2005 as a beefed-up replacement for fn_virtualfilestats and shows you how many I/Os have occurred, with latencies for all files. You can give it a database ID and a file ID, but I found it most useful to look at all the files on the server and order by one of the statistics.
Here’s the default output (with some column names changed slightly to make things fit nicely):
SELECT * FROM sys.dm_io_virtual_file_stats (NULL, NULL); GO
db_id file_id sample_ms reads bytes_read io_stall_read_ms writes bytes_written io_stall_write_ms io_stall size_on_disk_bytes file_handle ----- ------- --------- ----- ---------- ---------------- ------ ------------- ----------------- -------- ------------------ ------------------ 1 1 14433009 52 3227648 58 10 90112 12 70 515964928 0x000000000000013C 1 2 14433009 55 1241088 36 28 122880 115 151 12648448 0x0000000000000474 2 1 14433009 22 1261568 22 4 147456 1 23 8388608 0x000000000000065C 2 2 14433009 6 385024 2 7 154112 3 5 524288 0x0000000000000660 3 1 14433009 31 1859584 18 1 8192 0 18 1310720 0x0000000000000524 3 2 14433009 6 385024 5 3 16384 1 6 524288 0x000000000000056C 4 1 14433009 139 8175616 369 13 106496 45 414 260767744 0x0000000000000958 4 2 14433009 136 712704 485 7 53248 25 510 104595456 0x00000000000009F0 5 1 14433009 20 1138688 204 0 0 0 204 419430400 0x0000000000000980 5 2 14433009 13 411136 466 1 2048 13 479 104857600 0x0000000000000A44 6 1 14433009 15 811008 81 1 8192 8 89 1310720 0x0000000000000954 6 2 14433009 7 386560 89 4 11264 80 169 516096 0x00000000000009EC 7 1 14433009 15 811008 80 1 8192 9 89 1310720 0x0000000000000970 7 2 14433009 7 386560 130 3 9216 17 147 516096 0x00000000000009B4 8 1 14433009 16 876544 37 1 8192 103 140 136577024 0x0000000000000988 8 2 14433009 9 229888 11 4 10752 59 70 5242880 0x00000000000009E4 . .
This isn’t that useful because a) I don’t have database IDs and file paths memorized, and b) it gives aggregate latencies (io_stall_read_ms and io_stall_write_ms).
What I usually do is use the script below – part of my standard set of scripts I use when doing a server health check for a client. It’s based in part on code from by good friend Jimmy May (blog|twitter), with a bunch of tweaks. It allows me to filter on read or write latencies and it joins with sys.master_files to get database names and file paths.
SELECT
--virtual file latency
[ReadLatency] =
CASE WHEN [num_of_reads] = 0
THEN 0 ELSE ([io_stall_read_ms] / [num_of_reads]) END,
[WriteLatency] =
CASE WHEN [num_of_writes] = 0
THEN 0 ELSE ([io_stall_write_ms] / [num_of_writes]) END,
[Latency] =
CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
THEN 0 ELSE ([io_stall] / ([num_of_reads] + [num_of_writes])) END,
--avg bytes per IOP
[AvgBPerRead] =
CASE WHEN [num_of_reads] = 0
THEN 0 ELSE ([num_of_bytes_read] / [num_of_reads]) END,
[AvgBPerWrite] =
CASE WHEN [io_stall_write_ms] = 0
THEN 0 ELSE ([num_of_bytes_written] / [num_of_writes]) END,
[AvgBPerTransfer] =
CASE WHEN ([num_of_reads] = 0 AND [num_of_writes] = 0)
THEN 0 ELSE
(([num_of_bytes_read] + [num_of_bytes_written]) /
([num_of_reads] + [num_of_writes])) END,
LEFT ([mf].[physical_name], 2) AS [Drive],
DB_NAME ([vfs].[database_id]) AS [DB],
--[vfs].*,
[mf].[physical_name]
FROM
sys.dm_io_virtual_file_stats (NULL,NULL) AS [vfs]
JOIN sys.master_files AS [mf]
ON [vfs].[database_id] = [mf].[database_id]
AND [vfs].[file_id] = [mf].[file_id]
-- WHERE [vfs].[file_id] = 2 -- log files
-- ORDER BY [Latency] DESC
-- ORDER BY [ReadLatency] DESC
ORDER BY [WriteLatency] DESC;
GO
ReadLatency WriteLatency Latency AvgBPerRead AvgBPerWrite AvgBPerTransfer Drive DB physical_name ----------- ------------ ------- ----------- ------------ --------------- ----- ----------------------- ------------------------------------------ 2 103 8 54784 8192 52043 D: HotSpot D:\SQLskills\HotSpot_data.mdf 7 48 13 63061 5632 54857 C: DemoCorruptSystemTables C:\SQLskills\corrupt_log.LDF 32 41 33 43861 2048 39680 D: DemoSystemIndex3 D:\SQLskills\DemoSystemIndex3_LOG.ldf 27 40 28 43861 1024 39577 D: DemoSystemIndex5 D:\SQLskills\DemoSystemIndex5_LOG.ldf 23 39 24 43861 7168 40192 D: DemoSystemIndex4 D:\SQLskills\DemoSystemIndex4_LOG.ldf 7 38 10 43861 4608 39936 D: DemoSystemIndex1 D:\SQLskills\DemoSystemIndex1_LOG.ldf 11 25 12 43861 7168 40192 C: FSWaits C:\SQLskills\FSWaits_log.ldf 12 20 13 55222 6656 49152 C: newdatabase C:\SQLskills\newdatabase_log.LDF 12 20 15 55222 2816 36165 C: LockResDemo C:\SQLskills\LockResDemo_log.LDF 14 17 15 45933 2816 30254 C: TempdbTest C:\SQLskills\TempdbTest_log.LDF 1 14 5 25543 2688 18510 C: HotSpot C:\SQLskills\HotSpot_log.ldf 40 14 39 31625 7680 29915 D: RecompileTest D:\SQLskills\RecompileTest_log.ldf 9 13 10 51858 7680 46336 D: DemoFatalCorruption1 D:\SQLskills\DemoFatalCorruption1_log.ldf 35 13 34 31625 2048 29513 D: GUIDTest D:\SQLskills\GUIDTest_log.ldf 12 13 12 16861 1024 15872 C: FNDBLogTest C:\SQLskills\FNDBLogTest_log.LDF . .
This is much more useful as it allows me to quickly see where the read and write hot spots are and then drill into a database to see what’s going on, and if nothing out of the ordinary, ask the SAN admin to move those hot spot files to dedicated and/or faster storage.
One question I’m sure you’re going to ask is “what is an acceptable read or write latency?” and the answer is a big “it depends!”. It depends on the I/O subsystem and how it’s been configured. The key is producing a performance baseline for when things are running acceptably well and then seeing where the results from the DMV deviate from your baseline. On well-configured storage which isn’t being overloaded I’d expect to see single-digit ms for either read or write latency, but this will vary based on the rotation speed and technology of the drives (SCSI vs SATA vs SSD etc).
Enjoy!
19 Responses to How to examine IO subsystem latencies from within SQL Server
WriteLatency = CASE WHEN io_stall_write_ms = 0
THEN 0 ELSE (io_stall_write_ms / num_of_writes) END
I think you want to test num_of_writes = 0 in this case.
Indeed – blame messing around with formatting. Thanks!
Hi,
Thanks for the script.
How can I find out "I/O subsystem configuration"? Do I need to ask the windows engineers who look after the server?
@Manjot Yes – you’ll need to ask whoever manages the I/O subsystem.
This will display the file name instead of outputting the full path:
SUBSTRING(mf.physical_name, len(mf.physical_name)-CHARINDEX(‘\’,REVERSE(mf.physical_name))+2, 100)
I find this script to be quite useful when capturing data in segments especially during a job execution and peak usage.
Hey Paul!
I blogged about this earlier in the year too. I’ve got a method to collect the data over time.
http://sqlserverio.com/2011/02/08/gather-virtual-file-statistics-using-t-sql-tsql2sday-15/
Hi,
Could be operations like DBCC CEHCKDB, REBUILD, REORGANIZE REBUILD INDEX, UPDATE STATISTICS, … spoil this measures ?.
Well they won’t spoil the measurement, but they will add a lot of extra I/O workload and that needs to be taken into account when designing your I/O subsystem. An alternative is to offload the CHECKDB workload to another system after restoring a full backup there.
I am quite interesting in this topic hope you will write more on it in future posts
[...] Wait statistics are the bread-and-butter of performance tuning. SQL Server is keeping track of what resources threads need to wait for, and how long they need to wait. By analyzing which resources (and combinations of resource) are being waited for the most, you can get an idea of where to start digging in further. An example might be that if most of the waits are PAGEIOLATCH_SH waits, and this wasn't the case in your wait stats baseline, you might look at the I/O subsystem performance using the sys.dm_io_virtual_file_stats DMV (which I blogged about here). [...]
hi Paul,
i am trying to estimate the total bytes/sec of throughput that i am getting from my sql server and am wondering if it is accurate to say that my total bytes/sec of throughput would be = num_of_bytes_written/io_stall_write_ms?
thank you
scorellis
No – dividing by sample_ms is the correct way, but that only gives you an average, not your peak throughput. You need to use a benchmarking tool to get that.
Hi
I often use the sys.dm_io_virtual_file_stats DMV to calculate Avg Service Time for a file and also for a drive (by aggregation).
Recently, i have compared DMV latency results (ReadLatency and WriteLatency for each drive) and perfmon counters (Avg. Disk sec/Read and Avg. Disk sec/Write)
It showed [Avg Time ms/Read] was more important in DMV results than perfmon counters.
what is the reason for that ?
Thanks
Not sure what you mean by ‘more important’? If you mean that the avg time was higher in the DMV than in perfmon, could be that there are other uses of that drive doing I/Os that will be taken into account by perfmon, but not by the DMV, which is only measuring I/O latency for those files managed by SQL on that drive. Remember also that the DMV is aggregated data since the database came online whereas perfmon is a rolling average – that could also make perfmon lower if there have been some spikes of high I/O latency in the past that the DMV is still taking into account.
Hello and thank you for your reply
In fact, I have a process that collects data from the dmv dm_io_virtual_file_stats every minute and that makes the difference between each collection.
I therefore compare the results of the dmv (select Drive, sum (io_stall_read_ms) / sum (num_of_reads) from group by dm_io_virtual_file_stats_diff Drive) with perfmon counters (Avg. Disk sec / Read)
From a general way, we have similar time but when high latency disk, the time seen through my collection “DMV” are more important than the perfmon data (20-40%).
I am sure there is no other activity on the monitored drive (just SQLServer activity)
Is it possible to explain this phenomenon?
Is that these 2 two types of counters (perfmon and DMV) is based on the same metric?
I hope they are not stupid questions
Thank you for your help
Jean-Philippe
NB : sorry for my english
They’re not actually the same metric. The DMV is measuring everything from SQL Server to the disk and back. The PerfMon counters are just measuring the time Windows sees for the I/O. There could things like filter drivers etc in the way. Any significant difference between them can be explained by something taking time between SQL Server and Windows issuing the I/O.
Dear Paul,
I am seeing lot of negative values in Latency value of the given query. Is that the indication of an IO issue? Also I am seeing lot of events like this in eventlog and sql error log. SQL Server has encountered 1 occurrence(s) of I/O requests taking longer than 15 seconds to complete on file. Please advice.
I don’t see how you can get negative values for latencies, unless the DMV is returning negative values for total latencies, which would be a bug. If you’re seeing I/O stall warnings in the error log, that’s usually a sign of problems with your I/O subsystem.
Hello,
Since the dmv gives cumulative numbers, I persist the results each hour and compare sequential results. This way we found out our latencies are “acceptable” during production hours, while they are terrible during night due to index rebuilds, integrity checks, database backups, system backups, etl processes etc.
While creating the view (eg ([io_stall_read_ms] – previous.[io_stall_read_ms]) / ([num_of_reads] – previous.[number_of_reads]) I also saw negative values, caused by the fact the services had been restarted (counters are reset) between 2 measurements. After testing on service restart using (SELECT crdate FROM sys.sysdatabases WHERE lower(NAME) = ‘tempdb’) I still saw some negative numbers. It seems to me this is caused by database restore, so now I just test whether bytesread and byteswritten are higher in the previous measurement. In that case I assume the values represent the first measurement, I can compare with.
Kind Regards
Peter.