SQL Server Replication

New Course: “SQL Server: Transactional Replication Fundamentals”

Joseph Sack — Fri, 21 Jun 2013 17:08:12 +0000

Today Pluralsight published my new course, “SQL Server: Transactional Replication Fundamentals.”

This course provides a fundamental overview of how to configure, monitor, tune and troubleshoot a SQL Server transactional replication topology. Transactional replication meets a few specific data distribution requirements that no other native SQL Server feature does out-of-the-box. Even if you’re not a fan of this feature (and I’ve met quite a few), if you are responsible for architecting SQL Server data-distribution solutions it is helpful to understand how transactional replication fits into the overall scalability, high availability and disaster recovery feature landscape.

The post New Course: “SQL Server: Transactional Replication Fundamentals” appeared first on Joe Sack.

Transactional Replication Publications and Availability Groups

Joseph Sack — Wed, 22 Aug 2012 06:50:44 +0000

Books Online documents a few scenarios regarding Replication and Availability Group interoperability. Today I tested out the process detailed here:

Configure Replication for AlwaysOn Availability Groups (SQL Server)

It worked as advertised and I tested this on a five replica AG topology with three synchronous replicas (including the primary) and two asynchronous replicas. I won’t rehash the BOL steps – but I did want to mention a few observations about the process:

One of my AG replicas was also the same SQL Server instance as my subscription database (non-AG database), so I skipped the sys.sp_addlinkedserver step for that particular SQL Server instance. Collocation of the primary replica and subscriber worked fine.
While it is possible to make one of your participating replica SQL Server instances the distributor, it doesn’t make sense to do so from an HA/DR perspective. But if your distributor is indeed remote and not collocated with the AG replicas, think about FCIs for providing HA.
The publications show up in SQL Server Management Studio under the Replication\Local Publications folder. Hovering over the publication from a secondary replica will still show a yellow (tooltips-like) dialog box showing the original SQL Server instance where you created the publication – even if that replica is currently a secondary.
The New Publication Wizard doesn’t stop you from creating a Peer-to-Peer publication for an availability database, even though this combo is not supported by Microsoft. I didn’t finish P2P configuration – but now I’m curious if it actually works (even though it wouldn’t have support).
Deleting a publication for an availability database raises the error 18752 “Only one Log Reader Agent or log-related procedure (sp_repldone, sp_replcmds, and sp_replshowcmds) can connect to a database at a time”. This error was repeatable with or without existing subscribers. The error also gets followed up with a “change database context to” message. Even after the message, the publication does indeed get removed. This message is seen both with the GUI and with sp_droppublication. I’ll likely put out a Connect item on this one (I didn’t see one that matched my scenario).

Why consider replication when you have AG readable secondaries? There are several use-cases that I could think of – for example if you want to have a sub-set of the overall data and use customized indexing on the subscriber. Another case would be to have access to replicated data if there is an outage of the AG.

I’m going to write about testing the AG subscriber scenario in another post.

The post Transactional Replication Publications and Availability Groups appeared first on Joe Sack.

Replication Extended Events, Not a Tool in your Toolbox (Yet)

Joseph Sack — Tue, 21 Aug 2012 03:42:22 +0000

There are already a number of data sources you can reference when investigating replication issues. One data source on my wish list was to have a one-stop shop in Extended Events similar to the AlwaysOn Health Session.

It turns out that SQL Server 2012 does have a few new replication related events, but don’t get too excited… Books Online manages our expectations in the following text (underlined text added by me):

“Replication supports Extended Events, however, this feature is for internal use only at this time. Replication extended events were added to help customer support engineers collect information to troubleshoot replication problems. The information collected is not useful for replication performance tuning or monitoring.”

There was a dash of hope in the “at this time” qualifier, but that was the only good news I could get from this. But even then, I wanted to be absolutely sure that there were truly no hidden diagnostic data sources that could be leveraged for replication issues.

I found the following potentially promising events in sys.dm_xe_objects:

repl_event
logreader_process_text_info
logreader_process_text_ptr
logreader_process_filestream_info
logreader_add_compensation_range
logreader_add_eor
logreader_apply_filter_proc

The first one was what I decided to investigate today was “repl_event”. It is described in sys.dm_xe_objects as “Occurs when sp_repl_generateevent is called. this event is an internal repl event for tracing repl stored procedures. The data that is returned from user_event includes the event_id that was specified in the call to sp_repl_generateevent. This can be a value between x and y.“ :

CREATE EVENT SESSION [repl_event] ON SERVER 
ADD EVENT sqlserver.repl_event 
ADD TARGET package0.ring_buffer
WITH (MAX_MEMORY=4096 KB,EVENT_RETENTION_MODE=ALLOW_SINGLE_EVENT_LOSS,MAX_DISPATCH_LATENCY=30 SECONDS,MAX_EVENT_SIZE=0 KB,MEMORY_PARTITION_MODE=NONE,TRACK_CAUSALITY=OFF,STARTUP_STATE=ON);
GO

I started this event session and set up transactional replication (configured the distributor, simple publication, one subscriber). No events were triggered when I did this.

I then investigated the sp_repl_generateevent procedure itself. Looking at the definition (or trying), returned the message “replgenerateevent extended procedure”.

Which objects reference sp_repl_generateevent? All I could find was sys.sp_MSaddmergetriggers_internal. The referencing section of the system stored procedure was as follows:

select @command2 = ‘

       — update any ppm row that already exist with this gen

       update ppm set ppm.generation = case when @is_mergeagent = 1 then 0 else @newgen end

       from ‘ + @quoted_past_mappings_viewname + ‘ ppm with (rowlock) inner join deleted v

       on ppm.tablenick =@tablenick and ppm.rowguid = v.’ + @quoted_rgcol + ‘

       — insert the past partition mapping into gen 0 if this is the merge agent

      insert into ‘ + @quoted_past_mappings_viewname + ‘ with (rowlock) (publication_number, tablenick, rowguid, partition_id, generation,reason)

      select distinct ‘ + convert(nvarchar(100), @publication_number) + ‘, @tablenick, v.’ + @quoted_rgcol + ‘, v.partition_id, case when @is_mergeagent = 1 then 0 else @newgen end, 1

      from ( ‘ + @partition_deleted_view_rule + ‘ ) as v
      if (@@ROWCOUNT <= 0)

      begin

           select @xe_message = CAST(”replica_id: ” + convert(nvarchar(100), @replnick) + ”, article_id: ” + convert(nvarchar(100), @tablenick) + ”, rowguid: ” + case when @article_rows_deleted = 1 then convert(nvarchar(100), @rowguid) else N”0” end + ”, generation: ” + case when @is_mergeagent = 1 then N”0” else convert(nvarchar(100), @newgen) end + ”, Reason: -1” AS varbinary(1000));

        exec master..sp_repl_generateevent 1, N”Event : ppm_insert”, @xe_message

      end

      ‘

Since I was on a test SQL Server instance, I thought I would test out a direct call to this procedure just to indeed see that this procedure was hooked to the repl_event:

DECLARE @xe_message varbinary(1000) = 
    CAST('Event payload' AS varbinary(1000));

EXEC sp_repl_generateevent 1, N'Event: Am I captured?', @xe_message;

Sure enough – the repl_event was fired:

Whether repl_event gets leveraged in the future, we’ll see. If Microsoft implements this in the future, my wish list would include the following events (and knowing that we can capture these in other areas – but again I’m interested in a consolidated session):

Replication configuration events (creation, dropping, changes)
Agent statistics, like periodic reader/writer thread latency statistics
Subscription expirations
Conflicts
Failed replication jobs and retries
Data sync warnings
Interoperability events (for example – database mirroring failovers of the publication database)

I may investigate the logreader_* related events at some point, but based on the naming and descriptions of these events I don’t see significant use cases at this time.

If you run across any other replication related events that you find useful, please share your comments on this post. Thanks!

The post Replication Extended Events, Not a Tool in your Toolbox (Yet) appeared first on Joe Sack.

SQL Server Pro article–“Getting Started with Transactional Replication”

Joseph Sack — Thu, 19 Jul 2012 00:33:33 +0000

I wrote a beginner’s article for Transactional Replication which was published in the July 2012 edition of SQL Server Pro:

“Getting Started with Transactional Replication”

My last article with them (at the time SQL Server Magazine) was published way back in September 2002!

It was my very first published item – called “Put the Hammer Down.” This was around the time that I caught the writing bug and realized that it was a complimentary activity to SQL Server consulting. It was also interesting to look at that old article and see where my opinions have shifted over time (for example – I don’t pay attention to average disk queue length anymore and the whole separation of data from log files is a much more nuanced discussion).

But that one article back in 2002 got me started on the authoring and editing path – and I was thankful for that first opportunity.

The post SQL Server Pro article–“Getting Started with Transactional Replication” appeared first on Joe Sack.

The Transactional Replication Multiplier Effect

Joseph Sack — Thu, 23 Feb 2012 08:49:00 +0000

This post idea was prompted by a discussion I had this week with Jonathan Kehayias about an environment that had multiple transactional replication publications defined with overlapping table articles.

In other words, a table was defined as an article in more than one publication.

While I can think of some cases where you would want to leverage different article options or filters, in this particular case the articles had no differences in how they were defined. I’ve seen this in other environments in the past – and as I recall it wasn’t a conscious decision, but rather a lack of coordination across application teams and projects.

For small databases with lower volumes of modifications, this overlap could likely go unnoticed. For larger tables with high amounts of data modifications, well, consider the following scenario:

· You have two transactional replication publications that each reference the same table as an article. No other article properties are changed between the two publications and articles.

· Each publication maps to a single subscriber.

· Your table article setting for this scenario use the default – propagating INSERTs, UPDATEs and DELETEs via the default statement delivery method (spMSins_ / sp_MSupd_ sp_MSdel) etc. (And while we are propagating changes made directly to the table, we’re not using stored procedure execution articles.)

So let’s say we execute the following single statement batch update against the redundantly published table. This is one statement that updates 3,120 rows:

UPDATE dbo.charge

SET charge_amt = charge_amt * .97

WHERE provider_no = 386;

If we used sp_replcmds in the publisher database (I had the log reader agent job stopped in order to step through the scenario), how many command transactions would you expect to see marked for replication?

The answer is – 6,240. One call per row updated, multiplied by two separate publications (and we’re still only in the publication database):

And as you may expect, those 6,240 rows move on to the distribution database (you can validate via sp_browsereplcmds or MSrepl_commands):

Now had you instead just created ONE publication with that article sent to the two different subscribers, you would see just 3,120 in the publication database for the original update – and 3,120 as well at the distributor prior to multicasting the update to the two subscribers.

Coupled with the already “chatty” nature of transactional replication – you can imagine scenarios where performance rapidly degrades for large batch updates, particularly on already-constrained topologies.

The post The Transactional Replication Multiplier Effect appeared first on Joe Sack.

What are the Replication Agents waiting on? Accumulating wait stats by agent session

Joseph Sack — Wed, 15 Feb 2012 06:25:00 +0000

Consider the following scenario:

· You have Transactional Replication deployed

· Data is flowing, but just not as fast as you would like

· This scenario could apply to local/remote distributors and push/pull subscribers

There are several different techniques we can use to narrow down where the replication performance issue is happening. Perhaps you’ve already found that the performance issue is happening for log reader reads or distribution database writes. Or perhaps you suspect the issue is on the subscriber?

While the various replication techniques can help us narrow down the lagging member of the topology, I still would like more visibility into why a particular agent read or write process is performing more slowly. Fortunately, you can do this in SQL Server 2008+…

In the following example, I’ll start by retrieving the session IDs of the log reader and distribution agents (and as an aside my replication topology is SQL Server instance version 10.50.2500):

— Log Reader

SELECT session_id, program_name,

reads,

writes,

logical_reads

FROM sys.dm_exec_sessions

WHERE original_login_name =

'SQLSKILLS\SQLskillsLogReaderAG';

— Distribution Agent

SELECT session_id, program_name,

reads,

writes,

logical_reads

FROM sys.dm_exec_sessions

WHERE original_login_name =

'SQLSKILLS\SQLskillsDistAGT';

In this example I’m using separate accounts to run the agent executables, however I could have also added a predicate on program_name based on the publication I was interested in evaluating. For example, I could have said for the Log Reader agent – program_name = ‘Repl-LogReader-0-AdventureWorks2008R2-6’ and for the Distribution agent – that’s more interesting, as we have program_name = ‘CAESAR-AdventureWorks2008R2-Pub_AW_2008R2-AUGUSTUS-1’ (subscriber is AUGUSTUS, publisher is CAESAR). But if you just used that program name, you won’t get Replication Distribution History session, which would be program_name = ‘Replication Distribution History’ and may also be interesting.

So in my example, I have 5 different sessions I’m interested in (and yours will vary based on the number of published databases, independent agents, server role, etc):

· The log reader agent was using sessions 55, 57, 59

· The distribution agent had two sessions (61 for history and 62 for the executable)

Now that I have my session ids, I’m going to create an extended events session that I can run during the “slow performing” period to help illuminate where to investigate next (and for more general discussion on this technique, see Paul Randal’s post “Capturing wait stats for a single operation”):

CREATE EVENT SESSION Replication_AGT_Waits

ON SERVER

ADD EVENT sqlos.wait_info(

ACTION (sqlserver.session_id)

WHERE ([package0].[equal_uint64]([sqlserver].[session_id],(55)) OR [package0].[equal_uint64]([sqlserver].[session_id],(57)) OR [package0].[equal_uint64]([sqlserver].[session_id],(59)) OR [package0].[equal_uint64]([sqlserver].[session_id],(61)) OR [package0].[equal_uint64]([sqlserver].[session_id],(62)))),

ADD EVENT sqlos.wait_info_external(

ACTION (sqlserver.session_id)

ADD TARGET package0.asynchronous_file_target

(SET FILENAME = N'C:\temp\ReplAGTStats.xel',

METADATAFILE = N'C:\temp\ReplAGTStats.xem')

After creating the session, I’ll start. In my test, I just ran the slow performing workload against one of the published tables, launched Replication Monitor, waited for the rows to arrive at the subscriber and then stopped the event session:

ALTER EVENT SESSION Replication_AGT_Waits

ON SERVER STATE = START;

— Run representative replication workload against publisher

— Launch Monitor and wait for all trans to be fully distributed

ALTER EVENT SESSION Replication_AGT_Waits

ON SERVER STATE = STOP;

Next, I created two intermediate temp tables to start going through the collected data:

— Raw data into intermediate table

SELECT CAST(event_data as XML) event_data

INTO #ReplicationAgentWaits_Stage_1

FROM sys.fn_xe_file_target_read_file

('C:\temp\ReplAGTStats*.xel',

'C:\temp\ReplAGTStats*.xem',

NULL, NULL)

— Aggregated data into intermediate table

— #ReplicationAgentWaits

SELECT

event_data.value

('(/event/action[@name=''session_id'']/value)[1]', 'smallint') as session_id,

event_data.value

('(/event/data[@name=''wait_type'']/text)[1]', 'varchar(100)') as wait_type,

event_data.value

('(/event/data[@name=''duration'']/value)[1]', 'bigint') as duration,

event_data.value

('(/event/data[@name=''signal_duration'']/value)[1]', 'bigint') as signal_duration,

event_data.value

('(/event/data[@name=''completed_count'']/value)[1]', 'bigint') as completed_count

INTO #ReplicationAgentWaits_Stage_2

FROM #ReplicationAgentWaits_Stage_1;

Then I took a look at how things broke out by session_id:

SELECT session_id,

wait_type,

SUM(duration) total_duration,

SUM(signal_duration) total_signal_duration,

SUM(completed_count) total_wait_count

FROM #ReplicationAgentWaits_Stage_2

GROUP BY session_id,

wait_type

ORDER BY session_id,

SUM(duration) DESC;

Here were the results:

Session 55, 57 and 59 were my log agent sessions. Just looking at session 57 (highlighted in purple), we see that IO_COMPLETION had the highest wait duration. If I check out the accumulated reads from sys.dm_exec_sessions for that session, I see it is doing all reads, whereas session 59 was doing all writes (so we can start mapping to the agent thread roles).

Session id 61 (in yellow) represented the Replication Distribution History process and session id 62 (in green) represented the distribution agent process. As we can see for 62 – the longest duration was due to NETWORK_IO. We also see a similar value from PREEMPTIVE_OS_WAITFORSINGLEOBJECT (and if you think that these seem correlated, indeed this preemptive wait type is seen in conjunction with the network waittype wait).

So what would we see at the subscriber side? For this specific scenario, I saw the following (using the session of my distribution agent account) which was session id 55:

In this case, the top wait (by duration) was WRITELOG on the subscriber for the CAESAR_AdventureWorks2008R2_Pub_AW_2008R2 distribution agent process – although the number was not very high.

So if you’re experiencing slow replication, you may consider this additional technique in order to help further identify where the bottlenecks may be in the topology and also get initial ideas on why this may be.

The post What are the Replication Agents waiting on? Accumulating wait stats by agent session appeared first on Joe Sack.

When is the Publication Access List required?

Joseph Sack — Wed, 25 Jan 2012 01:36:00 +0000

Update: ** Make sure to check out the comments at the end of this post. There are some interesting differences in behavior between transactional replication (pull/push subscribers) versus merge replication's behavior. **

Yesterday I was working on implementing transactional replication with the goal of limiting the permissions each replication account ran under. I created three separate domain accounts for the snapshot, log reader and distribution agents. These accounts had no other permissions before I began:

· I created logins on the publisher and distributor (in this case, the same SQL Server instance) and I added the snapshot and log reader agent accounts to the db_owner role of the distribution and publisher databases.

· This was a push subscription, so I also added the distribution agent to the db_owner role for the distribution database, but I did not grant it access to the publication database. I did make the distribution agent a member of db_owner for the subscription database (which was located on a separate server and default instance).

· I gave the snapshot agent “write” permissions and the distribution agent “read” permissions to the snapshot share.

By the way, all this talk of db_owner makes it sound like I wasn’t limiting permissions all that much; however this fixed database role membership is indeed a minimum requirement in this implementation. It’s also typically more restrained then what I’ve seen out in the field. Usually I’ll see the use of domain accounts with sysadmin used to manage everything in the replication topology. I don’t usually see a separate set of accounts configured for each agent role, nor do I see them set up for each unique topology (for very large environments, the administrative overhead may not make this a practical choice – but that’s another discussion altogether).

I did leave out one step though – and I’ll get to that in a moment. After applying the permissions I described, I set up the new publication and new subscription, and the data flowed correctly with no issues and no sysadmin permissions required.

The step I specifically left out was the adding of the distribution agent to the Publication Access List (PAL). According to Books Online, “Access to a publication is controlled by the publication access list (PAL).” Also according to Books Online, the distribution agent for a push subscription must “Be a member of the PAL.” I wondered why? And if this is such a key area – why don’t we hear much discussion of the PAL? If you search the replication forums, you’ll find very few questions about it (searching today, I found 26 loosely related threads). Either this means that most shops use high privilege accounts and haven’t pushed further to find out the role of PAL – or the PAL role isn’t entirely what it seems to be (as its described, it seems to suggest that the distribution agent account needs membership in order to synchronize).

Now if it must be a member, why was transaction replication working properly (rows were moving fine from publisher to distributor and distributor to subscriber).

My first assumption was that I missed something or that somehow the distribution agent account was getting implied permissions either through group membership.

The first thing I validated was the current PAL list of accounts (looking explicitly for my distribution agent account – called SQLskills\SQLskillsDistAGT). Looking at the PAL, this account had NOT been explicitly granted membership somehow through other activities:

Perhaps SQLskills\SQLskillsDistAGT was gaining access through group membership? Seemed unlikely to me, but I checked nonetheless by using EXECUTE AS LOGIN and querying the sys.login_token to see the groups associated with that account:

I didn’t see any connections or group memberships that would map to the PAL.

My next thought was to examine the SQL Server Agent job and ensure it really was running under the context of SQLskills\SQLskillsDistAGT. The SQL Server Agent Job for the distribution agent was owned by the Administrator account, but the job step itself was running as the SQLskills\SQLskillsDistAGT proxy:

The proxy maps to a security credential, which in this case was my SQLskills\SQLskillsDistAGT account. I validated the mapping by querying sys.credentials (checking the credential_identity column):

So the mapping was what I expected.

But was the job really connected as that account? I ran a few test transactions at the publisher and again confirmed that rows were flowing to the subscriber. I then queried sys.dm_exec_sessions for the distribution agent session, checking the login name and running a few times to ensure it was incrementing the logical reads:

Logical reads were incrementing and the job was indeed running under the account.

So where are we? Basically, I could find no connection whatsoever between the PAL membership and my distribution agent account.

So because I wanted to be absolutely sure (and because this was a test environment) I removed all accounts from the PAL (including “sa”). I did so one-by-one, testing to see if it broke replication. And guess what? Replication just kept on working. I even restarted the agents to see if it would initiate some kind of challenge-response, and it did not.

So is PAL access required? And if so, what is the boundary of that requirement?

I logged off of my Administrator account and logged in to the publisher/distributor SQL Server instance as the SQLskills\SQLskillsDistAGT. I then opened up SSMS and looked to see if I could view the publication:

No publications to be seen, even though this account is actually responsible for running the distribution agent and is doing so successfully.

I then jumped back on my Administrator account and first added SQLskills\SQLskillsDistAGT to the public role of the publication database (required in order to be seen in PAL) and then I added SQLskills\SQLskillsDistAGT to the PAL:

After doing this, I logged aback in as the distribution agent account, and sure enough, I can now “see” the publication (and also launch a new subscription, more importantly).

So this now made sense why PAL wasn’t the talk of the town. Most DBAs I’ve worked with set up replication with their own high privilege credentials – even when designating other credentials for the replication agents. Once they do, the agents work as advertised. It’s when the agent account wishes to participate independently of the DBA that the PAL helps restrict the visibility of available publications.

If you’ve seen other variations or even contradictions related to the PAL – I’d love to hear about it. We can help flesh out some of the ambiguities around this feature on this post.

The post When is the Publication Access List required? appeared first on Joe Sack.