While working on a video recording for Paul this week I ran into an interesting problem with one of my Windows Server 2016 clusters. While attempting to add a new node to the cluster I ran into an exception calling Add-ClusterNode:
The server ‘SQL2K16-AG03.SQLskillsDemos.com’ could not be added to the cluster.
An error occurred while adding node ‘SQL2K16-AG03.SQLskillsDemos.com’ to cluster ‘SQL2K16-WSFC’.
Keyset does not exist
The windows account I was using was the domain administrator account and I had just recently made modifications that involved the certificate store on this specific VM, so I decided to take a backup of the VMDK and then revert to a snapshot to try again, and this time it worked. So needless to say I was intrigued as to what I could have done that would be causing this error to happen. It turns out that while installing a SSL certificate for SQL Server to use, I had negatively impacted the permissions for the C:\ProgramData\Microsoft\Crypto\RSA folder, and the keys that are protected inside of that folder, which happen to be the private keys for the certificates on the server. The normal permissions for this folder can be seen on the screenshot from the working node in the cluster.
UPDATE: 1/24/2020 – A reader provided a link from Microsoft Support for the default permissions as well (https://support.microsoft.com/en-us/help/278381/)
On the broken copy of the VM, the Owner was not set, and none of the other permissions were even close to matching and inheritance had been propagated (oops!!!). The really crazy thing is, I don’t even know how exactly I caused this to happen at this point, but none of the keys were accessible to anything on that node. While I was able to go into the folder and manually set the Owner: and all the permissions on each of the Keys in the Machine folder to allow the node to be joined to the cluster, I ultimately went and evicted the node and rebuilt it instead of relying on manually applied permissions to make it work. Here is why:
The owner of the folders is SYSTEM, which is easy to go back and set, but to set the permissions on the Keys inside the Machine Keys folder, you have to Take Ownership of the key:
Even as an Administrator, you don’t have Read access, so when you click the Advanced button, it can’t show any information and offers to try Administrative Permissions to do it, but again that fails when you click Continue to get the objects permissions you can only take ownership to view the properties, which then means you don’t know who the correct owner actually should have been.
However, in a real production down scenario where you made a mistake setting permissions and the cluster nodes can’t join into the cluster, start the cluster service correctly, or take ownership of the resources, manually taking ownership and setting permissions on the Keys as shown in this screenshot from another environment I intentionally destroyed the permissions on to try and complete this blog post, can get you out of a pinch.