Keyset Does Not Exist – Add-ClusterNode

While working on a video recording for Paul this week I ran into an interesting problem with one of my Windows Server 2016 clusters. While attempting to add a new node to the cluster I ran into an exception calling Add-ClusterNode:

The server ‘SQL2K16-AG03.SQLskillsDemos.com’ could not be added to the cluster.
An error occurred while adding node ‘SQL2K16-AG03.SQLskillsDemos.com’ to cluster ‘SQL2K16-WSFC’.

Keyset does not exist

The windows account I was using was the domain administrator account and I had just recently made modifications that involved the certificate store on this specific VM, so I decided to take a backup of the VMDK and then revert to a snapshot to try again, and this time it worked.  So needless to say I was intrigued as to what I could have done that would be causing this error to happen.  It turns out that while installing a SSL certificate for SQL Server to use, I had negatively impacted the permissions for the C:\ProgramData\Microsoft\Crypto\RSA folder, and the keys that are protected inside of that folder, which happen to be the private keys for the certificates on the server.  The normal permissions for this folder can be seen on the screenshot from the working node in the cluster.

UPDATE: 1/24/2020 – A reader provided a link from Microsoft Support for the default permissions as well (https://support.microsoft.com/en-us/help/278381/

 

On the broken copy of the VM, the Owner was not set, and none of the other permissions were even close to matching and inheritance had been propagated (oops!!!).  The really crazy thing is, I don’t even know how exactly I caused this to happen at this point, but none of the keys were accessible to anything on that node. While I was able to go into the folder and manually set the Owner: and all the permissions on each of the Keys in the Machine folder to allow the node to be joined to the cluster, I ultimately went and evicted the node and rebuilt it instead of relying on manually applied permissions to make it work.   Here is why:

The owner of the folders is SYSTEM, which is easy to go back and set, but to set the permissions on the Keys inside the Machine Keys folder, you have to Take Ownership of the key:

Even as an Administrator, you don’t have Read access, so when you click the Advanced button, it can’t show any information and offers to try Administrative Permissions to do it, but again that fails when you click Continue to get the objects permissions you can only take ownership to view the properties, which then means you don’t know who the correct owner actually should have been.

However, in a real production down scenario where you made a mistake setting permissions and the cluster nodes can’t join into the cluster, start the cluster service correctly, or take ownership of the resources, manually taking ownership and setting permissions on the Keys as shown in this screenshot from another environment I intentionally destroyed the permissions on to try and complete this blog post, can get you out of a pinch.

Fixing A previous restore operation was interrupted and did not complete processing

I recently was working on a customer system where a myriad of bad things had occurred simultaneously that left them in a scenario where their primary database server had corruption of an important database and their DR Log Shipped Secondary server couldn’t be rolled forward due to the Restore job failing with the following type of error:

RESTORE LOG AdventureWorks2014
FROM DISK = ‘D:\SQLBackups\Logs\AdventureWorks2014_LOG_202001021250.trn’
WITH FILE = 1,NORECOVERY, STATS = 10

14 percent processed.
31 percent processed.
49 percent processed.
66 percent processed.

Msg 3203, Level 16, State 1, Procedure s_res_translog_for_database, Line 70
Read on “D:\SQLBackups\Logs\AdventureWorks2014_LOG_202001021250.trn” failed: 38(Reached the end of the file.)
Msg 3013, Level 16, State 1, Procedure s_res_translog_for_database, Line 70
RESTORE LOG is terminating abnormally.

Further log restores on the log shipped secondary, even manually would also get this error:

Msg 4319, Level 16, State 3, Line 1
A previous restore operation was interrupted and did not complete processing on file ‘AdventureWorks2014log’. Either restore the backup set that was interrupted or restart the restore sequence.
Msg 3013, Level 16, State 1, Line 1
RESTORE LOG is terminating abnormally.

Since the log shipped database was multi-terabytes in size, reinitializing it from backup wasn’t an ideal situation, but what other options are there?  The key lies in the original error message from log shipping, the file restore of the log reached the end of the file and wasn’t expecting it to be the end of the file.  This means one of two things happened:

  1. Either the log backup file was in flight when the main server crashed and restarted and only a partial backup file was written and then copied over to the log shipped secondary and the restore job tried to restore this broken file (NOTE: This was a theory that I have proven to not be possible as the header would have prevented the restore)
  2. The backup had already completed and the copy file job was running when the server crashed and the file wasn’t fully copied over to the log shipped secondary and it tried to restore it when the restore job ran, generating the first exception

Comparing the file on the source server to the file on the log shipped secondary revealed that the sizes were in fact different, which means that scenario two from above is most likely what occurred.  The backup file on the log shipped secondary was deleted and then the full file was manually copied over to the secondary server. However, when the Restore Job would run, it would still fail with error 4319. Manually applying the file with restore log would also result in:

Msg 4319, Level 16, State 3, Line 1
A previous restore operation was interrupted and did not complete processing on file ‘AdventureWorks2014log’. Either restore the backup set that was interrupted or restart the restore sequence.
Msg 3013, Level 16, State 1, Line 1
RESTORE LOG is terminating abnormally.

It seems totally broken, and then I tried the same file but specifying CONTINUE_AFTER_ERROR since at this point there wasn’t really much risk of doing more harm.  I was maybe 50/50 that this might work, but if it didn’t we were going to have to reinitialize from a full backup either way, so it was worth trying as it stood.

RESTORE LOG AdventureWorks2014
FROM DISK = ‘D:\SQLBackups\Logs\AdventureWorks2014_LOG_202001021250.trn’
WITH FILE = 1,NORECOVERY, STATS = 10, CONTINUE_AFTER_ERROR;

The file restored and then the Log Shipping Restore Job immediately resumed normal restore operations of the subsequent files without errors.  Since the file was not damaged, this worked and allowed the restore log operation to complete and clear the previous end of file encountered error bit successfully so further restore operations could be applied without errors.

SQL Server PDB Symbols and Callstacks The Easy Way

Lately, I have been doing a lot of work troubleshooting certain behaviors in SQL Server for workloads that are, to put it simply, designed horribly. As a part of this, I have found it necessary to collect callstacks with Extended Events and to materialize them using the debugger symbols for SQL Server to see where exactly certain types of issues are being encountered to better understand some of the internals of newer features in SQL Server 2017 and 2019. Years ago I blogged about how to use the package0.callstack action in Extended Events for this type of thing, and Paul also has a blog post that talks about how to download the PDB symbols for SQL Server as well as a post that also demonstrates using the package0.callstack action to determine what causes a particular wait type. Using the debugging tools to get the symbols is somewhat clunky and tedious, so when I happened on this amazingly simple method of getting symbol files I had to share it.

The SQLCallstackResolver on Github has to be one of the greatest things since sliced bread if you want to materialize callstacks from SQL Server. All you need is the current release of the application, and the appropriate SQL Server binaries, and it will create a Powershell script to download the appropriate symbols files. There is even a Wiki page with scripts for downloading most of the builds for all of the versions of SQL Server but if one happens to not be there, you can run the application, choose the Binn path where that build’s executable and DLL’s are located and it will generate the PowerShell script to download the symbols. What’s even better though is that the application allows you to paste a callstack from any version of SQL Server into the window, point the Symbols path to the appropriate folder with the symbols downloaded, and it uses the Microsoft.Diagnostics.Runtime.dll project to materialize the callstack to the function names and calls. So if you have the output of the callstack and you know what the build of SQL Server is that generated it, you can now materialize the functions with just this application. No need to take a memory dump of SQL Server, no need to use Symchk, no need for trace flag 3656 live in your server, you can copy/paste/resolve in one place.

This is very, very cool!