Availability Group Readable Secondaries – Just Say No

Since SQL Server 2012, Microsoft has put a lot of efforts into marketing AlwaysOn Availability Groups as a feature and solution for both high availability/disaster recovery for SQL Server, and scalability for read-only workloads.  Over the last six years, I have talked more clients out of using Availability Groups than I have actually implemented for SQL Server for a lot of different reasons.  As easy as Microsoft has made it to build an Availability Group, restore and synchronize the databases, join secondary servers, etc. the simple fact is that once they are in production usage, you had better know what not to do that could break things, the impact having a readable secondary, and how to troubleshoot problems when they occur, because they will.

A lot of the implementation questions I get for Availability Groups are generally looking at leveraging a secondary replica for High Availability or Disaster Recovery and often times also for offloading reporting or readable workloads.  Readable Secondary copies of the database are really simple at the surface conceptually, but there are a lot of additional considerations I make that could lead you to another technology entirely like Transactional Replication or Log Shipping with Standby. Let’s look at how these three stack up against each other for different data access requirements.

Real-time Data

First things first, for real-time data access, you only have one option, the primary writeable copy of the database. In replication this is the publisher database, log shipping calls it the log shipped primary, and Availability Groups refers to it as the primary replica.  None of the secondary copies of the data, even in a synchronous Availability Group replica, is ever guaranteed to actually be real-time data.  A synchronous Availability Group only guarantees that the log records for an operation are hardened in the synchronous secondary replicas transaction log allowing them to later be redone on the secondary database when a transaction commits.  Depending on the volume of activity that is occurring, and whether the secondary replica is being used for read-only workloads or has sufficient hardware, the redo thread can typically be anywhere from a few hundred milliseconds to a few seconds behind the actual primary replica.  This means that if you insert data into a table, scope the identity value and return it back to the application, then attempt to read that record immediately from the read-only secondary, it could actually not be there (Yes, this actually does happen).

Near Real-time Data

For near real time data access, both Replication and the Availability Group can meet the requirement for near real-time data. Replication typically will have slightly more latency than an Availability Group readable secondary in either commit mode by a few seconds due to the way replication functions.  The log reader agent has to read the transaction log, write replication changes to the distribution database, and then the distribution agent has to pickup those changes up and apply them to the subscriber.  This typically has a latency of 3-7 seconds under normal conditions, where an Availability Group readable secondary could have less latency depending on the volume of activity. The big difference between the two is entirely in the implementation and features supported.

If you need a non-clustered index to support the read-only workload, that index must be created on the primary database of an Availability Group for it to be available on the readable secondary.  This means you have to backup and maintain that index and the maintenance will affect all servers in the Availability Group as a logged operation requiring redo on secondary replicas.  This is not the case with transactional replication, where the subscriber database is a writeable copy of the database and can have a completely different indexing design from the publisher database that is based on just the read-only workload.  Since the subscriber database is writeable, that also means that you can use features like Query Store, plan forcing, and automatic plan correction for the read-only workload; all of these are unavailable for a readable secondary replica in an Availability Group because the database is an exact duplicate of the primary (which can be enabled for Query Store and use these features).

Another consideration is the impact of the 14-byte row-version tag generation to support Read Committed Snapshot Isolation (RCSI) for the read-only replica in an Availability Group.  This can only be generated on the primary replica database, but is not required on the publisher database for transactional replication. Why does this matter? It can lead to page splits that cause logical fragmentation of the indexes. Moving to a readable secondary may require monitoring index fragmentation and introducing fill factor changes where you wouldn’t have needed them previously. If you have read the incredibly bad information online telling you that you can ignore index fragmentation entirely with SSDs, or that you don’t need to rebuild indexes with an Availability Group, consider this: logical scan fragmentation may not matter much from a performance standpoint on fast storage, but a low page density caused by page splits and fragmentation affects every single replica copy and wastes space on disk and in memory in the buffer pool, and affects plan select due to query costing by the optimizer. With replication, all of this is avoided entirely unless you use RCSI for your publisher workload anyway, but that does not impact the subscribers in anyway since they are only replicated copies of the data modification operations, not the pages themselves.

Point-in-time Data

For point in time data access, I often go low tech and suggest using Log Shipping with Standby and delayed restore to create a warm copy of the database that is restored to a point in time.  This can be useful for accounting reporting on the previous days sales transactions, performing ETL into a data mart or data warehouse, and it also provides the benefit of having a second copy of the database at a known point in time for DR purposes.  You can also do this type of reporting off of the near real-time data provided by Replication and Availability Groups typically as well. The table below breaks down the features and requirements of each of these three technologies for providing a readable copy of the database.

Real Time Data Secondary Delay Readable Workload Indexes Row-version Tag Overhead Query Store Support Automatic Plan Correction Partial Data Only
Availability Groups (Sync) Primary Only Depends on Redo typically within a few seconds Must be created on Primary Primary database with Readable Secondary Primary Only Primary Only

No

Availability Groups (Async) Primary Only Depends on Log Send Queue and Redo typically within a few seconds Must be created on Primary Primary database with Readable Secondary Primary Only Primary Only

No

Replication Publisher Only Depends on log reader and distribution agent typically within a few seconds May be created on subscriber copy only Only if RCSI enabled Publisher and Subscribers Publisher and Subscribers

Yes

Log Shipping Primary Only Depends on restore job configuration and standby – Point in time only, updates require disconnecting users Must be created on Primary Only if RCSI enabled Primary Only Primary Only

No

Complexity and Manageability

So what about complexity and manageability of these different options?

In the opening paragraph of this post I pointed out how easy Microsoft has made it to build an Availability Group to handle read-only copies of a database or group of databases. Setting up an Availability Group is really quite simple but that doesn’t mean it’s not lacking in complexity.  First you need to know what kind of Availability Group are you going to implement.  What started as a single feature has morphed over each release into:

  • Basic Availability Groups – Standard Edition (single database)
  • Enterprise Availability Groups – Enterprise Edition (one or many databases)
  • Distributed Availability Groups – Support for up to 17 readable secondaries without daisy-chaining
  • Domain Independent Availability Groups – Environments without Active Directory
  • Read-scale Availability Groups – No clustering requirement or HA

Add Linux into the mix of options as well as whether or not you need failover clustering as a part of that configuration, and this isn’t as simple as it seems.  One of the common areas of failure with Availability Groups that I’ve seen is with choosing the correct quorum configuration for the Windows Failover Cluster configuration, though Microsoft has made this easier in Windows Server 2012 R2 and higher. An error in quorum configuration can result in your High Availability solution failing and taking the databases offline as a result.

Replication tends to get a bad rap when it comes to complexity and manageability. The reality is that the technology just isn’t well understood by most SQL Server professionals, and it is really easy to configure (there’s a great UI in SSMS for it also), and as long as you don’t play with it or modify data on the subscriber databases, it usually just works without a hitch. Change management is something that has to be planned out with Replication, where it wouldn’t have to be with Availability Groups.  New tables have to be added to replication manually as an article, and certain operations like dropping and recreating a table or truncating a table are not allowed when the table is being replicated (sorry, no more using SSMS designer to insert a column in the middle of a table). Compared to Availability Groups, just for change management, Replication is much more complex and requires more work to manage over time, but the types of changes that require manual intervention in Replication typically are not occurring at a frequent pace for established applications and databases.

Log Shipping is just stupid simple low-tech and easy to manage. I mean for real how difficult is taking a backup and restoring it on another server? You might be surprised, but the configuration wizard and Agent Jobs automate this so that you typically don’t have to do anything but let it do its thing. Even third party backup tools offer their own flavor of implementation of log shipping as a part of their backup products for SQL Server typically. If something ever gets out of sync for log shipping, it usually means that someone took a manual log backup of the database, or switched recovery models and broke the log chain, and the standby database will have to be reinitialized from a new FULL backup of the primary to resume the restores (or you can manually apply the missing backup if a manual log backup was performed to close the gap).

Summary

While Availability Groups have become the Go To technology for providing a readable copy of data for SQL Server, they are not the only feature available to meet this requirement.  It is important to understand the requirements and the limitations of any feature that you are considering using, and to also evaluate the other available features that might be a better fit for the needs of the environment.  I have used transactional replication for nearly 10 years to solve read-only workload and reporting requirements for SQL Server and continue to use it over Availability Groups where as a technology, it is just a better fit for the requirements.  Being able to reduce indexing on the primary writeable copy of the database and only having readable workload non-clustered indexes on the subscribers can be a huge benefit for some workloads. For one client it was the difference between a 800GB publisher database and a 1.7TB primary AG replica which affected their backup/restore times and RTO SLAs.  Collect the requirements, evaluate the technologies and limitations, then choose the appropriate way to solve a problem.

23 thoughts on “Availability Group Readable Secondaries – Just Say No

  1. Very interesting post, thanks! In our case the rate of replication-breaking schema changes is so high (as the business keeps growing rapidly) that the simplicity of the change management is a huge benefit of AGs. It comes with a price of course — our DBA team has been tested under fire more than once on various AG complexities. Still, it works very well for us.

  2. Well done as usual Jonathan. I have been preaching to my clients about the issues with readable secondaries almost since they were introduced. I have also come across several that were really getting hosed by the 14-byte pointer gotcha.

  3. You guys with your doom and gloom on AGs need to stop. You’re scaring me.
    I have had nothing but success implementing AG’s since SQL 2012. Readable secondary, synchronous and all. I have implemented AG’s with gMSA’s and Kerberos authentication in Windows/SQL 2016 and I have none of these issues being talked about with AG’s.
    I can only imagine that many of the issues people have with AG’s are because their environments lack the resources to fully support them or maybe I am just lucky and the other shoe is about to drop ?

    AG’s and clustering require an understanding of networking, DNS and other technologies that regular vanilla SQL Server does not. It also should not be implemented on cheap/unsupported hardware/network. If your network/hardware is not well supported, don’t do clustering or AG.

    Log shipping is awesome. Great solution to most reporting/warm secondary needs. I’ve set up log shipping with an FTPS middle layer that at times I forgot was there because it was so reliable. If you have to use it for DR, it’s a good, workable solution.

    I would NEVER recommend replication to anyone unless there were absolutely no other way. Replication has a ton of moving parts. I can’t imagine there are many real cases to be made where replication is a better fit for HA/readable secondary than AG’s or log shipping. Write/Write copies is the only use case I can think of and it’s the worst one. Bleh, how much trouble do you want in life ?
    The world is hostile enough. Don’t recommend replication.

    1. I have plenty of real cases where replication is a better fit and does a much better job for reporting and readable workloads. Replication is no more or less complex than an AG secondary to support if you understand the feature an the requirements. The world is hostile enough, that’s why I don’t push Availability Groups where they are a poor fit or something else does a much better job for the task at hand, like Replication.

      1. Sorry bro – Andy’s right – AGs beat repl hollow. Been doing this (DBA and fullstack dev) work for decades including remote DBA work for 100s of customers – replication as a preferred solution is very rare. AGs rock the house but you do need skillz. Good on yer for mentioning 14-byte opinters tho.

    2. “I can’t imagine there are many real cases to be made where replication is a better fit for HA/readable secondary than AG’s or log shipping.”

      There are many real cases (Some of it already mentioned in Jonathan table.)
      – where you do not need hundreds or thousands tables on your readable copy. Just specific ones.
      – where cost is important, Subscriber can be on Standard Edition
      – you can replicate to old versions of SQL server
      – indexes can be created only subscriber (in AG need to be created on Primary as well)
      – very flexible (if you know how to configure it- including horizontal and vertical filtering and many many more)

      In our env we are using both AG and transactional replication but for very different reasons. AG for HA, replication for data movement and reporting.

      Botch AG and Replication has it place and both you need fully understand to use it properly

  4. Thanks for the interesting post. My understanding is, on sync AlwaysOn Availability Groups, SQL Server won’t commit the transaction on primary until it commits it on all Synced replicas.
    We may get latency and slowness, but data will always be in sync since it will be committed on secondary first. We will see data on primary first only if we use “Read Uncommitted”
    What you mentioned in the comparison table, it says we are not getting real-time data on the primary replica of a synced AlwaysOn availability group.
    Can you explain it a little?

    1. I explain this in the post above. Your only guarantee with a synchronous commit secondary is that the log records are hardened in the transaction log file of the secondary on disk. It DOES NOT guarantee that they have been redone by the redo threads on the secondary, or that querying data on the secondary will return the same results as on the primary.

  5. It can be complicated to determine index usage statistics on AGs. Before determining if an index is underused, we have to run sp_blitzindex on all of the replicas and try to integrate the output.

    1. That’s another great point and limitation that doesn’t exist with a Replication Subscriber. You only have to keep the indexes used on each of the systems being analyzed in isolation instead of across all of the AG Replica’s to know whether something is really used or not.

  6. I frequently encounter a serious problem in transactional replication, which is caused by the log reader agent doesn’t catch up with the data update on the publisher database. The error is “The Log Reader Agent is scanning the transaction log for commands to be replicated. Approximately XXXXXXXX log records have been scanned in pass # 3, 0 of which were marked for replication, elapsed time XXXXXX (ms).” Index rebuild will also lead to such problem too. I always need to drop the replication in order to fix it, otherwise the size of the transaction log will continue to expand.

    1. How many VLF’s are in the publisher database transaction log?

      https://www.sqlskills.com/blogs/kimberly/transaction-log-vlfs-too-many-or-too-few/

      The message that you are calling an “error” isn’t an error, it is an informational message telling you that the log reader has had to process large amounts of log records and 0 of the records being processed were actually marked for replication. Often where you see this and large latency for the log reader or where the log reader can’t catch up it is due to excessive VLF’s in the log file from a percent based autogrowth configuration. I would start by eliminating any excessive VLF’s in the log file and see if the issue goes away. If it doesn’t then latency in the log file I/O subsystem would be the next thing to look at, followed by how exactly are you doing index maintenance because you may be doing more than is actually beneficial and it’s just flooding the transaction log with unnecessary changes.

    2. I would also suggest looking into Log Reader Agent profile. In our case increasing REadBatchSize and LogScanTreshold helped (we did test it before).
      But still what Jonathan mentioned…VLFs ,how indexes are done and IO I would check first.

  7. Great Post Jonathan. I agree on most of them to choose wisely. I see benefit in my environment using AG as i can offload my backup and db maintenance tasks from primary replica to secondary replica. We were facing lots of issues with backups and maintenance like rebuild and update stats causing slowness for 30 TB database. However we do run under the risk of potential data loss using async but it meets RPO and RTO so we are good.

  8. Excellent article, Jonathan. I’d also add that building an AG is, in theory, ‘simple’, as long as some other team builds the cluster for you 🙂

  9. Hi Jonathan, I don’t think I like the title of this post, don’t get me wrong i agree with the content of the post but i feel the title gives the wrong impression that they should always be avoided, as pointed out by the content of the post, it depends on the requirements not just trying to even the flow of choices….

  10. Unfortunately, Query Store can’t be enabled on readable secondaries, because the database are readonly, and Query Store contains it’s repository data at the database level. This bad news if your use case for the secondary is to support most application queries.
    Also, some database tools by the most popular vendors won’t connect to a readable secondary, because there is simply no option in the connection dialog to supply the ApplicationIntent=ReadOnly property (only server name, database name, login). This is actually an issue of various SQL Server features and 3rd party tools needing to mature and be more AOG compatible, but it’s still something to keep in mind before implementing AOG.

    1. The table in the blog post reflects that you can’t enable Query Store except on the Primary Replica, just as it points out the fact that you can only do indexing of the primary database for AG readable workloads. You can connect directly to a readable secondary by not going through the listener, and by setting the option for readonly access to ALL with SECONDARY_ROLE(ALLOW_CONNECTIONS = ALL) and just point the application directly to the server name for that replica. Most applications do not follow readonly routing through the listener, and if you test that thoroughly you will find that it doesn’t always behave exactly how you would expect during failovers.

  11. Great post Jonathan!! Thank you for all the insights and key points to think about. Really helpful.

  12. Prerequisite for AGs is that the primary databases must be in full recovery model. That’s a pain for reindexing operations which then become fully logged and so take much longer to complete. For transactional replication the recovery model of the publisher can be simple, bulk logged or full. It doesn’t matter. For log shipping the recovery model of the primary db can be bulk logged or full recovery. Reindexing operations are much faster under bulk logged recovery model. Reindexing is a pain point for very large indexes for fully logged AG databases.

  13. Hi. Great article. I don’t understand this statement: “ and certain operations like dropping and recreating a table or truncating a table are not allowed when the table is being replicated (sorry, no more using SSMS designer to insert a column in the middle of a table). ”

    Does this mean you can’t drop or truncate a table WHILE it’s being replicated, or does it mean you can’t replicate a table if you plan on periodically dropping or truncating it?

    1. You have to drop the article from the publication before you could do either of those operations. It will raise an exception if you try to truncate a replicated table or drop it until it has been removed from replication. So you can technically remove the article do the change and add it back.

  14. Hi. Great article. Attended one of your sessions at a SQLIntersection BTW.

    I don’t understand this statement: “ and certain operations like dropping and recreating a table or truncating a table are not allowed when the table is being replicated (sorry, no more using SSMS designer to insert a column in the middle of a table). “

    Does this mean I can’t truncate or drop a table WHILE it’s being replicated, or that i can’t replicate a table if I plan on periodically truncating or dropping It?

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.