(Been a few days since I posted – had some real work to do :-) Today I'll post a few things from the queue that's been building up)
This is part Q&A and part follow-on from my last post about running index maintenance when a database is mirrored.
A customer has a maintenance plan that involves running regular ALTER INDEX … REORGANIZE on a 100GB clustered index to remove fragmentation. Three weeks ago they added database mirroring, with the database setup for synchronous mirroring. Every so often, they see the state of the mirror change from SYNCHRONIZED to SYNCHRONIZING and then a bit later back to SYNCHRONIZED. What's going on? Once a synchronously-mirrored database is synchronized, it should ever get out of sync, right?
Well not quite – if the communication link between the principal and the mirror is broken, then the mirror becomes unsynchronized. The exact behavior in this situation depends on how mirroring is setup and what's failed:
- If there's no witness instance, then transactions will continue on the principal database but the transaction log starts to grow, because the transactions can't be cleared from the principal's log (even after a log backup) until they've been sent to the mirror. The database is running 'exposed'. Once the link is reestablished, the mirror while synchronize again.
- If there's a witness, and the witness can still talk to the principal, then everything continues as in #1
- If there's a witness, and the communication link between it and principal is also broken, the the principal will stop serving the database – transactions will stop. In this case, if the mirror and the witness can still see each other, then a failover will occur.
There are some great Books Online entries that describe all of this – see http://msdn2.microsoft.com/en-us/library/ms179344.aspx to start with.
The customer had situation #1. Every so often the mirror would change state and it seemed to coincide with the defrag job. Looking in the error log shows messages like:
2007-10-24 11:43:36.21 spid23s Error: 1474, Severity: 16, State: 1.
2007-10-24 11:43:36.21 spid23s Database mirroring connection error 2 'Connection attempt failed with error: '10060(A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)'.' for 'TCP://roadrunnerpr.sqlskills.com:5022'.
So the network link was dying sometimes when the defrag was running – that explains the switch between SYNCHRONIZED and SYNCHRONIZING. Why the network link was dying is still under investigation but it seems like the additional transaction log generated by the defrag job was causing the network to become overloaded and some component of it wasn't behaving correctly under load.
There are a few things to learn from this:
- Not only do you need to make sure that your IO subsystem can handle the load on it correctly, you also need to make sure your network can handle the load on it. There are a bunch of tools available to stress-test network paths – one simple one is TrafficEmulator.
- When you're running on your test system before going into production, make sure you test *everything* as if you were running in production – including maintenance jobs because they can add significant load to a production system.
- When you implement an HA solution such as mirroring, consider all the ways that transaction log will be generated when figuring out the required network bandwidth to support your HA configuration – something like a defrag or rebuild can cause an enormous spike in log generation