Yesterday on Twitter a few people expressed disgust at a conference slide listing transactional replication as a high-availability (HA) technology. I took exception to that and argued. I think the discussion merits a blog post so here it is. This point of view is controversial and I'm expecting some dissenting comments – bring them on!
I teach high-availability inside Microsoft and at the MCM level and I absolutely always teach that transactional replication, and it's more advanced brother – peer-to-peer replication, have their place as a high-availability technology. The two whitepapers on high availability I've written for Microsoft over the last two years also discuss transactional replication as a high-availability technology (links later).
To start with, let's define a high-availability technology as one which makes data more highly available in the event of a disaster. That definition does not imply any limits on the amount of data loss or downtime experienced – and to do so would be incorrect.
As some background, many people approach HA planning the wrong way – either by trying to shoe-horn a new strategy onto an incumbent and unsuitable technology or just picking the technology they've heard of – which is most commonly failover clustering. Planning an HA strategy means carefully laying out the business requirements, considering technical and non-technical limitations, reaching a compromise between the technologists and the business managers, and THEN starting to evaluate technologies. Once you start evaluating technologies you'll find that no single HA technology that SQL Server provides is a one-size-fits-all solution – they all have pros and cons. The resulting strategy will likely utilize multiple technologies to meet the strategy's needs.
You can read in more depth about my HA planning methodology in the whitepaper I wrote for Microsoft last year: High Availability with SQL Server 2008. This also has a good overview of transactional and peer-to-peer replication.
There are seven main arguments I hear against transactional replication as an HA technology – I'll take them on one at a time.
Argument 1: It's a scale-out technology, not an HA technology
So where's the actual argument here? Sure, transactional replication and it's more advanced brother, peer-to-peer replication, are primarily query scale-out technologies. But both provide extra copies of some data, so both can be used to provide higher availability of that data in the event of a disaster.
Argument 2: There's too much latency involved – how can that be HA?
High availability does not mean zero data loss and zero downtime. High availability means the data is more available in the event of a disaster than if you had no redundant copy. Sure, there is some latency involved with transaction replication but there's also latency involved with log shipping, and with asynchronous database mirroring.
The key is making sure that the availability provided by the technology fits within your desired downtime and data loss SLAs (service level agreements) – commonly known as RTO (recovery time objective) and RPO (recovery point objective), respectively. If you're fine with potentially losing 1/2 hour of data, and the transactional replication latency between the Publisher and Subscriber is only 10 minutes, then transactional replication can meet your data loss requirements.
Argument 3: It's really hard to know what you're missing if the Publisher crashes
Correct. But it's just the same as the other asynchronous technologies – log shipping and asynchronous database mirroring – so this is not a valid argument against transactional replication.
Argument 4: Replication is too complicated to set up
This argument is asinine IMHO. A technology may be complicated to use or troubleshoot, but that does not make it unsuitable – it just means you have to have someone who knows what they're doing. I've seen people mess up configuring log shipping – does that mean it's not an HA technology? No. Failover clustering can be fiendishly difficult to set up at the Windows level – does that mean it's not an HA technology? No. This argument is also invalid.
Argument 5: It only replicates some of the data
Correct. This means it's unsuitable if you'd like to protect an entire database or several databases – but if its restrictions work for you, no problem.
Argument 6: You can't make certain schema changes on a published table
Correct. But that doesn't mean it's not suitable for providing HA – as long as you can live with its restrictions.
Argument 7: There's no automatic failure detection or automatic failover
Correct. But neither does log shipping. This just means you have to provide the mechanisms for this. Transactional replication still means your data is available if the Publisher crashes.
[Edit: Argument 8: There's no easy way to fail back again
Correct – for transactional replication – that can be a PITA. With peer-to-peer replication that problem disappears because the changes that happened while one peer node was down are automatically pushed to it when it comes back online.]
Arguments FOR using transactional replication
So these arguments are really saying that transactional replication has some restrictions. Sure it does. As I said above, no single HA technology is a one-size-fits-all. Now, what about all the good things about transactional replication?
- It only replicates the data you're interested in. Unlike log shipping or database mirroring, transactional replication only ships changes to the data you're interested in, rather than the entire database. Yes, there are applications where this is good. And it can be especially useful if you have to perform regular index maintenance on large indexes and you don't have the network bandwidth to ship transaction log for these.
- It works in the SIMPLE recovery model, unlike log shipping or database mirroring, where you MUST be in the FULL recovery model (BULK_LOGGED is also permissible with log shipping). You don't have to perfom log management with log backups just because you're using an HA technology. Yes, there are strategies who don't want to be able to restore the entire database to a point in time.
- You can have as many Subscribers as you want. Well, to a point. You need to be careful of network bandwidth from the Distributor, the Distribution Agent location (especially if they're all push subscriptions), and some other things. Log shipping also allows as many secondaries as you want but database mirroring can only have one mirror.
- You can make use of republishing (having a Subscriber that is also a Publisher to down-stream Subscribers), which allows some interesting topologies and balancing of network traffic. No other SQL Server HA technology allows this.
- The replicated data is available for reading and updating on both sides (and changes can then be published again using updateable subscriptions, or more easily with peer-to-peer replication). No other SQL Server HA technology allows this and this is one of the primary reasons why transactional replication is sometimes the necessary choice.
- There's no wait time when a failover is necessary – the data is just there. With log shipping, recovery has to complete – which could take a while for any uncommitted transactions to roll back. With database mirroring, the REDO queue on the mirror must be recovered before the mirror database can come online as the principal.
It all comes down to making rational choices of technologies based on the HA strategy requirements and limitations. Transactional replication will work for some of you, and not for others – it has its limitations just like the other SQL Server HA technologies. But it should never be dismissed out-of-hand as not being a high availability technology.
Take a look at the whitepaper I wrote for Microsoft this year – Proven SQL Server Architectures for High Availability and Disaster Recovery – where one of the five real-world customer examples (with customer names, case studies, etc) is about using transactional and peer-to-peer replication. That example links to a very comprehensive document the SQL CAT team wrote about the decision process that led to using transactional replication and all the setup and tuning that was done.
[Edit: one more whitepaper I wrote in 2008 that you might be interested in: SQL Server Replication: Providing High Availability using Database Mirroring.]
Thanks for reading!
PS If you post a comment and it doesn't appear it's because I'm moderating comments against spam – not to filter out dissention. Be patient and it will appear!
18 thoughts on “In defense of transactional replication as an HA technology”
What is the state of affairs regarding the schema? Does xact repl include "everything"? Like triggers, indexes, constraints, data types. Will timestamp be timestamp etc? I honestly don’t know the current state of this, but if we don’t get "everything", then xact repl as HA is only for those who can understand and handle these things…
Nice to know this other side.
I Believe that Replication is the only kind of remotely keeping the data that allows filestream?
or am I wrong?
Is there any kind of P2P network load balancer?
I’m personally not fond with the fact that replication block me from doing some schema changes, like increasing the size of a column (sql 2000). I do prefer Mirroring. But Peer to Peer replication calls for a study, since I don’t know how it works. And to think about only replicating some data is really powerfull.
Do you happen to have any docs on where to start learning P2P replication?
100% agree!! Combine replication with a load balanced front-end that allows you to direct users to whichever server you want, and you’ve got a great HA solution!!
@Tibor Timestamp is just a number. Yes, you can replicate indexes. Data types – yes, even FILESTREAM. The others I’m not sure about – I’d have to check.
But yes, you need to know what you’re doing – just like with any other technology – and if it doesn’t meet your requirements, don’t use it. Doesn’t mean it’s not an HA technology though…
@Gabriel Log shipping allows FILESTREAM too, as does failover clustering. I don’t know of a specific NLB for P2P. Take a look at the first whitepaper I mentioned in the blog – it links out to more info. Also look at the last example in the 2nd whitepaper.
Another drawback for transactional replication being a true, flexible HA solution is that you have to have a schema control of the database, i.e. each table requires a primary key. So it impossible (difficult) to use it for a vendor-close databases.
One the other hand I saw a big multinational company successfully using a MERGE replication for its Web site database (main factor for merge being use of a custom conflict resolution procedures).
One thing I always discuss with clients when looking at DR is which is more important, having all the system available as soon as possible, having all the data available as soon as possible.
I’m sure that replication can be used for HA, if (and that is a big if) you know what you are doing. Most of us can handle this, given the time and understanding of the app. We can set up a test install, compare the schemas, check the data types, see whether the triggers are carried over. Users, permissions. Data types. Etc.
But this is over the head for most accidental DBAs. Just consider the fact that there is no checkbox when you setup replication (where you configure your articles) which states “best options for HA failover config”!
Imagine somebody who know little about TSQL, who has some ISV developed app, a black box to him/her, glancing over this list and try to determine what to select. I just counted for a snapshot replication setup and there are 36 options. There might be a few more or a few less for transactional, but that is beside the point. Somebody who isn’t TSQL fluent won’t even understand the meaning of half of these.
The further we go from a “binary copy”, the more effort it takes to get it right, and more responsibility for the DBA. Cluster is easiest, since this *is* the binary… original. Mirroring and log shipping is a bit more work since the DBA need to work out all dependencies of the “outside world”, which actually makes half my basic admins students scared.
So, sure, replication can be used for HA, but it is a lot more responsibility from the DBA to make it right. And if you don’t make it right, then you aren’t in a good spot (which applies of course for all technologies)…
(Aside: timestamp is not only a number. It carries functionality. It used to be that this is replicated as binary(8) (and there are still traces of this), meaning that your optimistic concurrency goes out the door, if you use timestamp for this. Apparently, default nowadays is to replicate timestamp as timestamp,, just thought I’d mention this…)
@Tibor Indeed – so you shouldn’t be setting up any technology if you don’t know what you’re doing. Replication can be pretty complex – but going back to my argument in the blog post, complexity does mean it should be disregarded and complexity does not mean it’s not an HA technology.
Ha Paul, your typo is showing. Your latest reply indicates replication should be disregarded because it’s complex. If you can, you might fix that post.
One of the things I didn’t like about transactional replication is that the identity property isn’t retained in the schema. Merge replication was more attractive to me because it made fail-over easier. If you are looking for a data repository, transactional replication is a great solution. Like you said, one size does not fit all.
Reminds me of one time I was asked to investigate a log shipping problem. Turns out the log files were stored on the same drive as the database and that drive was RAID 5 because it was a backup system. Gee, wonder why there was 48 hour latency?
I attended a pre-SQL Saturday event on Database Design and Implementation where Louis Davidson talked about how in gathering requirements you may say "Tree" your customer may hear "Tree" but they think "Sheep". And it is those miscommunications that cause a lot of problems.
Under High Availability I think Warm Standby and Hot Standby don’t often get flushed out or people say HA and assume that they are always talking about Hot Standby systems.
If your system can deal with the data properly, and you have P2P replication set up it could be a Hot Standby as long as your network is configured correctly.
Transactional replication, like Log shipping, could be viewed as more of a Warm Standby technology, as you would need some manual intervention to reverse the subscription in a possible DR/fail over scenario.
Like you said though it depends on SAL’s, RTO, RPO, and the clear communication with your customer so that when you say "Tree" they see "Tree".
Obviously a bit slow on getting to read this one but thought I would just say that I agree totally with Paul on this. One of the points made here is that replication will only replicate the data you are interested in. That is a big point when looking at VLDB’s with high activity and when a lot of that activity is not required for HA. With that there is the necessity to understand your data and data requirements but as data storage continues to increase this is a necessary trend rather than just something that is required when considering HA.
Thanks for the thoughts.
Bravo! Replication is just one of the techniques we might use to solve a problem and to be successful we have both under and over bias. I’ve used it a bit and if you take time to explore, it’s – in my view – fairly elegant for what it does. Kudos for taking on the discussion.
I see that you have said about a point on index rebuild.
Quick question – Does index rebuild logs propagate to the subscriber ?
If not, do we need a separate index optimization strategy on the subscriber?
How does this work in log shipping/mirroring ?
No – as replication just propagates logical operations. Yes. Log shipping/mirroring/AG are sending physical log records, not logical operations.
Thanks Paul. Good to know.