(The Curious Case of… used to be part of our bi-weekly newsletter but we decided to make it a regular blog post instead so it can sometimes be more frequent. It covers something interesting one of us encountered when working with a client, doing some testing, or were asked in a random question from the community.)
From Jonathan this time…
Recently I worked with a client that has an Availability Group configured across two replicas where the secondary replica became suspended due to a transaction log file growth encountering an out of space issue on the local disk where the log file was hosted. Unfortunately the client didn’t realize that the secondary replica had suspended data replication until the primary replica also ran out of space and began throwing “The transaction log for database ‘%ls’ is full due to ‘%ls’.” errors.
At the point that we got involved the transaction log on the primary replica was over 2TB in size as a result of the client executing a DBCC SHRINKDATABASE command to try to shrink the size of the log file to free up space for the secondary replica to be able to resume. This was unfortunately the wrong method of trying to resolve this problem and ended up making the problem that much worse. You see, when the transaction log on a secondary replica fails to grow automatically during redo of the logged operation on the primary replica that grew the primary log file in size, the secondary replica becomes suspended until it can perform that operation. Shrinking things on the primary replica has no impact to the secondary replica until it gets resumed and can perform the operation that caused it to become suspended.
Looking at the secondary, it was discovered that there was a full text index log file that was over 500GB in size consuming space on the same disk as the database data and log files. Two years ago I blogged about the need to recycle the full text engine log files manually to prevent these massive files from being created. However, this was the largest file I have seen to date.
The first thing we did is kill the DBCC SHRINKDATABASE command that was causing excessive logging on the primary replica, and did some cleanup of some files in the log folder on that server to free some space up to allow the log file to grow and put the database back into a functional state. These typically wouldn’t have been any problem but because the log had chewed up all of the available space doing shrink operations it was the safest method of getting a little space back to allow the remainder of the problem to be worked on.
Then the secondary replica SQL instance was stopped and the log files (fulltext and error, not database) were completely deleted to free up space on the secondary replica, and the instance was restarted. However, this didn’t solve the issue entirely as the database then had to be manually resumed so the data synchronization would start again. It was also discovered that a non-AG database on the secondary replica was sharing the same disk as the main database, and was only created for a temporary problems investigation and could be removed to free up another 400GB of space on the disk.
The secondary replica was able to fully resynchronize back into the AG after 8 hours and then the log file on the primary could be shrunk back down to a normal size, freeing space on both servers.
Bottom line: the moral of the story is to make sure that you understand the cause and potential effect of any changes before you make them on a production system, and also that you are monitoring your Availability Group database statuses so that you don’t end up in a bad situation to begin with. Luckily, with a little out of the box thinking and understanding the root causes of the problem, this was able to be successfully saved without having to fully reinitialize an 8TB transactional database after removing it from the AG to unpin the transaction log.
2 thoughts on “The Curious Case of… disconnected AG secondary replica and transaction log out of space issue”
“… and the log files were completely deleted …” I assume these are not SQL logfiles but regular text logfiles?
errorlog and fulltext, correct. I’ve made that clearer above. Thanks