(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)
This one’s from Jonathan, in his own words – enjoy!
Recently I received an email with a very vague description of a problem that went something along the lines of:
“After some issues with Windows patching and rolling back from snapshots, our cluster resources won’t come online. In the process of troubleshooting the issue and validating the fix, we rolled back a few times. We can’t find any details about the issue in the cluster logs, Windows event logs, or SQL Server error log to investigate further.”
To be honest, this is one of those emails that makes you scratch your head and pause for a moment. If you already picked up on the issue, I’m sure it makes complete sense.
First let me say that reverting a VM snapshot of a SQL Server cluster node is not only not supported, it’s a notoriously bad idea. Trust me, I won’t even do this for lab virtual machines that participate in a cluster that have all been snapshotted to the exact same point in time, along with an AD domain controller, in a completely quiescent environment. Instead I have scripts that will tear down and rebuild an entire configuration on-the-fly to reset my demo environment back to a known good configuration, including restoring database backups, resetting configuration changes, and rebuilding and synchronizing an Availability Group.
There are so many things that can go wrong with reverting a snapshot of a cluster node, and the entire point of having an HA configuration like a Failover Clustered Instance or an Availability Group is so that you don’t have downtime associated with something like patching. It’s a huge mistake to roll out patches to all of the servers in a cluster at the same time, or even in the same evening. We have a couple of clients that don’t even patch their Availability Group replicas; they build new ones that are fully patched, join them to the WSFC, join them to the Availability Group, and then failover. Then when they know that all is well, they evict one of the original replicas, and rinse and repeat. The point is they have a failback plan no matter what, that ensures this sort of problem doesn’t happen.
Now… back to those missing logs. If you haven’t already figured it out, when a VM is reverted to a snapshot, it loses all of the data and logs that were created after the snapshot was initially created. Unfortunately in this case, the rollback by reverting to a snapshot also killed all of the diagnostic information that could have been used to determine the root cause of their issues with patching. I honestly believe this is a situation where panic started to set in and everything for the admin involved became quicksand. The more they did the faster they felt like they were sinking.
Sometimes the best thing you can do when something is going catastrophically wrong is NOTHING. Walk away, take a break, collect your thoughts, and regroup before working the problem. I somewhat jokingly say that as a consultant it’s easy for me to remain calm and level headed in a ‘production down’ situation because it’s not my system or job on the line. However, I’ve spent more than my fair share of time in the hot seat before joining SQLskills, including dropping the largest table in a SOX-audited production financial database because I was distracted by a developer asking a question, while in the middle of comparing two different SSMS windows against each other to produce a change script to send through change control.
While rolling back by reverting to a snapshot seemed like a fast and easy fix when things started going sour in this environment, it unfortunately meant that created an entirely new set of problems to be dealt with. It also meant they were going to have to repeat this same event and potentially face the same problems again without rolling back to a snapshot to troubleshoot their way forward. Luckily the problem was related to a script upgrade failure of msdb where the DatabaseMailUserRole schema and TargetServerUserRole objects existed in an incorrect state and the script upgrade process failed causing the engine to shutdown and fail to start. This is a known issue in certain situations that is easily remedied by starting SQL Server in script upgrade mode using a trace flag and manually removing the DatabaseMailUserRole and TargetServerUserRole security objects from msdb and then restarting to allow the script upgrade process to run to completion which recreates those items.
Summary: had the error message in the startup error log after the patching reboot been reviewed and searched online, they’d have found the Books Online topic that describes the exact steps to fix the issue. Had they not applied patches to all the servers at the same time and without testing that failover was going to be successful, they also could have failed the instance back to one of the other cluster nodes and then taken the time to work through what was in the logs. Lessons learned!
One thought on “The Curious Case of… missing log information”