There are two pretty well-known I/O errors – 823, and 824 – but there's also one called 825 which most DBAs do*not* know about, and definitely should.
From SQL Server 2005 onwards, if you ever see an 823 or 824, SQL Server has actually tried that I/O a total of 4 times before it finally declares a lost cause and surfaces the high-severity I/O error to the connection's console, killing the connection into the bargain. The idea behind this read-retry logic came from Exchange, where adding the logic reduced the amount of immediate downtime that customers experienced. While in concept this was something I agreed with at the time, I didn't agree with the way it was implemented.
If the I/O continues to fail, then the 823/824 is surfaced – that's fine. But what if the I/O succeeds on one of the retries? No high-severity error is raised, and the query completes, blissfully unaware that anything untoward happened. However, something *did* go badly wrong – the I/O subsystem failed to read 8KB of data correctly until the read was attempted again. Basically, the I/O subsystem had a problem, which luckily wasn't fatal *this time*. And that's what I don't like – the I/O subsystem went wrong but there are no flashing lights and alarm bells that fire for the DBA, as with an 823 or 824. If read-retry is required to get a read to complete, the only notification of this is a severity-10 informational message in the error log – error 825. It looks like this:
Msg 825, Level 10, State 2, Line 1.
A read of the file ‘D:\SQLskills\TestReadRetry.mdf’ at offset 0×0000017653C000 succeeded after failing 2 time(s) with error: incorrect checksum (expected: 0×4a224f20; actual: 0×2216ee12). Additional messages in the SQL Server error log and system event log may provide more detail. This error condition threatens database integrity and must be corrected. Complete a full database consistency check (DBCC CHECKDB). This error can be caused by many factors; for more information, see SQL Server Books Online.
Now, what this message is really saying is that your I/O subsystem is going wrong and you must do something about it. And unless you're regularly scanning the error log looking for these, you'll be none-the-wiser.
So – my recommendation is that you add a specific Agent alert for error 825, along with your other alerts (see this blog post).