sqlskills-logo-2015-white.png

What are the Replication Agents waiting on? Accumulating wait stats by agent session

Consider the following scenario:

·        You have Transactional Replication deployed

·        Data is flowing, but just not as fast as you would like

·        This scenario could apply to local/remote distributors and push/pull subscribers

There are several different techniques we can use to narrow down where the replication performance issue is happening.  Perhaps you’ve already found that the performance issue is happening for log reader reads or distribution database writes.  Or perhaps you suspect the issue is on the subscriber? 

While the various replication techniques can help us narrow down the lagging member of the topology, I still would like more visibility into why a particular agent read or write process is performing more slowly.  Fortunately, you can do this in SQL Server 2008+…

In the following example, I’ll start by retrieving the session IDs of the log reader and distribution agents (and as an aside my replication topology is SQL Server instance version 10.50.2500):

— Log Reader

SELECT session_id, program_name,

             reads,

             writes,

             logical_reads

FROM sys.dm_exec_sessions

WHERE original_login_name =

       'SQLSKILLS\SQLskillsLogReaderAG';

— Distribution Agent

SELECT session_id, program_name,

             reads,

             writes,

             logical_reads

FROM sys.dm_exec_sessions

WHERE original_login_name =

       'SQLSKILLS\SQLskillsDistAGT';

In this example I’m using separate accounts to run the agent executables, however I could have also added a predicate on program_name based on the publication I was interested in evaluating.  For example, I could have said for the Log Reader agent – program_name = ‘Repl-LogReader-0-AdventureWorks2008R2-6’ and for the Distribution agent – that’s more interesting, as we have program_name = ‘CAESAR-AdventureWorks2008R2-Pub_AW_2008R2-AUGUSTUS-1’ (subscriber is AUGUSTUS, publisher is CAESAR).  But if you just used that program name, you won’t get Replication Distribution History session, which would be program_name = ‘Replication Distribution History’ and may also be interesting.

So in my example, I have 5 different sessions I’m interested in (and yours will vary based on the number of published databases, independent agents, server role, etc):

·        The log reader agent was using sessions 55, 57, 59

·        The distribution agent had two sessions (61 for history and 62 for the executable)

Now that I have my session ids, I’m going to create an extended events session that I can run during the “slow performing” period to help illuminate where to investigate next (and for more general discussion on this technique, see Paul Randal’s post “Capturing wait stats for a single operation”):

CREATE EVENT SESSION Replication_AGT_Waits

ON SERVER

ADD EVENT sqlos.wait_info(

       ACTION (sqlserver.session_id)

    WHERE ([package0].[equal_uint64]([sqlserver].[session_id],(55)) OR [package0].[equal_uint64]([sqlserver].[session_id],(57)) OR [package0].[equal_uint64]([sqlserver].[session_id],(59)) OR [package0].[equal_uint64]([sqlserver].[session_id],(61)) OR [package0].[equal_uint64]([sqlserver].[session_id],(62)))),

ADD EVENT sqlos.wait_info_external(

       ACTION (sqlserver.session_id)

    WHERE ([package0].[equal_uint64]([sqlserver].[session_id],(55)) OR [package0].[equal_uint64]([sqlserver].[session_id],(57)) OR [package0].[equal_uint64]([sqlserver].[session_id],(59)) OR [package0].[equal_uint64]([sqlserver].[session_id],(61)) OR [package0].[equal_uint64]([sqlserver].[session_id],(62))))

ADD TARGET package0.asynchronous_file_target

       (SET FILENAME = N'C:\temp\ReplAGTStats.xel',

        METADATAFILE = N'C:\temp\ReplAGTStats.xem')

GO

After creating the session, I’ll start.  In my test, I just ran the slow performing workload against one of the published tables, launched Replication Monitor, waited for the rows to arrive at the subscriber and then stopped the event session:

ALTER EVENT SESSION Replication_AGT_Waits

ON SERVER STATE = START;

— Run representative replication workload against publisher

— Launch Monitor and wait for all trans to be fully distributed

ALTER EVENT SESSION Replication_AGT_Waits

ON SERVER STATE = STOP;

Next, I created two intermediate temp tables to start going through the collected data:

— Raw data into intermediate table

SELECT CAST(event_data as XML) event_data

INTO #ReplicationAgentWaits_Stage_1

FROM sys.fn_xe_file_target_read_file

             ('C:\temp\ReplAGTStats*.xel',

              'C:\temp\ReplAGTStats*.xem',

              NULL, NULL)

— Aggregated data into intermediate table

— #ReplicationAgentWaits       

SELECT

       event_data.value

       ('(/event/action[@name=''session_id'']/value)[1]', 'smallint') as session_id,

       event_data.value

       ('(/event/data[@name=''wait_type'']/text)[1]', 'varchar(100)') as wait_type,

       event_data.value

       ('(/event/data[@name=''duration'']/value)[1]', 'bigint') as duration,

       event_data.value

       ('(/event/data[@name=''signal_duration'']/value)[1]', 'bigint') as signal_duration,

       event_data.value

       ('(/event/data[@name=''completed_count'']/value)[1]', 'bigint') as completed_count

INTO #ReplicationAgentWaits_Stage_2

FROM #ReplicationAgentWaits_Stage_1;

Then I took a look at how things broke out by session_id:

SELECT session_id,

             wait_type,

             SUM(duration) total_duration,

             SUM(signal_duration) total_signal_duration,

             SUM(completed_count) total_wait_count

FROM #ReplicationAgentWaits_Stage_2

GROUP BY session_id,

             wait_type

ORDER BY session_id,

             SUM(duration) DESC;

Here were the results:

clip_image002

Session 55, 57 and 59 were my log agent sessions.  Just looking at session 57 (highlighted in purple), we see that IO_COMPLETION had the highest wait duration.  If I check out the accumulated reads from sys.dm_exec_sessions for that session, I see it is doing all reads, whereas session 59 was doing all writes (so we can start mapping to the agent thread roles). 

Session id 61 (in yellow) represented the Replication Distribution History process and session id 62 (in green) represented the distribution agent process.  As we can see for 62 – the longest duration was due to NETWORK_IO.  We also see a similar value from PREEMPTIVE_OS_WAITFORSINGLEOBJECT (and if you think that these seem correlated, indeed this preemptive wait type is seen in conjunction with the network waittype wait).

So what would we see at the subscriber side?  For this specific scenario, I saw the following (using the session of my distribution agent account) which was session id 55:

clip_image003

In this case, the top wait (by duration) was WRITELOG on the subscriber for the CAESAR_AdventureWorks2008R2_Pub_AW_2008R2 distribution agent process – although the number was not very high.

So if you’re experiencing slow replication, you may consider this additional technique in order to help further identify where the bottlenecks may be in the topology and also get initial ideas on why this may be. 

7 thoughts on “What are the Replication Agents waiting on? Accumulating wait stats by agent session

  1. Hi Joe,

    Great information.

    Based on the results, where is the bottleneck and how do you fix it?

    Thanks.

    Tim

  2. Thanks Al and Tim!

    Tim, regarding the bottleneck, in this scenario I would prioritize the IO_COMPLETION of session 57 (look at I/O path related considerations). So the key thing is to use the Waits and Queues methodology as you would with any other session – only in this case, we’ve narrowed it down to our agent sessions. Make sense?

Comments are closed.

Other articles

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.