Ok – I’m on a roll today so to finish off I’d like to repost some info about database mirroring failover types (including how you may not actually get a failover when you expect it) and a tip for how to avoid unwanted failovers when combining clustering and mirroring. Mirroring failovers have cropped up several times in my blog’s search engine logs so this is an easy one to answer. This is based on a TechEd post from June on the Storage Engine blog.
Clustering, mirroring and altering the partner timeout
The first question is about combining failover clustering and database mirroring. A customer wanted to put the instance hosting the principal database on a failover cluster and have the mirror database on an unclustered server. The issue he saw with the setup was when something happened on the active cluster server and it failed over. The failover would take 60-90 seconds, which means the principal database would not be available during that time.
The problem stems from the way mirroring figures out whether to initiate a failover. The mechanism it uses is to have all SQL Server instances within a mirroring session send out ‘pings’ on each of its open mirroring connections every second. If an instance does not receive a ping from one of the other instances in the session within a defined time-out period (called the partner time-out), it considers the other instance as unavailable and takes the appropriate action. The default time-out period is ten seconds.
In the case above, when the cluster is failing over and the principal instance is unavailable for 60-90 seconds, the other instances in the mirroring session will time-out after ten seconds and initiate a mirroring failover. Although this is the correct behavior of mirroring, it is undesirable behavior for this customer. The solution to the problem is to change the partner time-out value – a facility that isn’t well known.
To change the partner time-out value for a mirroring session, use the following code:
ALTER DATABASE mydb SET PARTNER TIMEOUT 90; GO
The value at the end of the statement is the new partner timeout value in seconds. There are a couple of things to be aware of if you’re going to do this:
- You can only issue this statement on the principal server.
- Be very careful not to set the time-out value too low otherwise you run the risk of triggering failovers becuase of false-failures – especially on heavily-loaded systems that may not be able to respond within the time-out period. In fact, if you specify a time-out value of 4 seconds or lower, SQL Server will automatically set the time-out to 5 seconds to help avoid such problems.
- This only works if mirroring thinks it’s a soft failure – like a natural cluster failover. If you initiate a cluster failover, mirroring will failover. The only way to absolutely guarantee that mirroring will not failover is to remove the witness from the topology.
Database mirroring failover types
What are the different kinds of failures that can trigger mirroring failovers, and how quickly does the failover happen after the problem occurs? As with most questions I get, I can use my favorite answer of “It depends!” :-) Let’s look at some examples of failures and see how quickly the failover occurs, in decreasing order of speed.
- Fastest: The fastest possible failover occurs when the SQL Server instance crashes (so the mirroring connection endpoint no longer exists) but the operating system is still running. When the next ping comes from a partner instance, the OS knows that the network port (that was being used by the mirroring connection endpoint) is no longer being listened to and returns a failure. This immediately triggers a failover.
- Fast: The next fastest failover occurs when the machine hosting the SQL Server instance crashes or shuts down (e.g. power supply cord pulled out of the back). In this case, nothing happens until the partner time-out period has expired (as there’s no OS running to return the immediate failure) and then a failover will occur. By default this will be in ten seconds, but you can change this as I explained above.
- Slow: A slow failover is when something happens to the server but it takes a while for the database to go offline. An example of this would be someone pulling out the transaction log drive on the principal server. Writes to the transaction log will start to queue up. After 20 seconds SQL Server will issue an IO warning but it isn’t until 40 seconds has passed that SQL Server issues an IO failure and the database goes suspect. It’s not until this point that the failover occurs – even though the partner timeout value is 10 seconds!
- Maybe fast or not at all: This discussion all started by someone asking about how quickly a failover happens if a page checksum failure is detected. My answer was – it depends! If the page checksum failure happens during a query, then all that happens is that the checksum failure is reported, the query rolls back, and the connection is broken. It’s only if the page checksum failure happens during a transaction rollback that a failover will be triggered. This is because a transaction rollback failure means the database is in a transactionally inconsistent state and has to be set to suspect mode – which triggers a mirroring failover.
So, don’t assume that just because mirroring is setup that every failure will trigger a fast failover.