Calling all user group leaders! We want to present for you again in 2018!

So far this year, we at SQLskills have presented remotely (and a few in-person) to 59 user groups and PASS virtual chapters around the world, and we have 17 more scheduled!

Now we’re into the second half of 2018, we’d like to schedule some more presentations through the rest of the year.

We’d love to present remotely for your user group in 2018, anywhere in the world. It’s not feasible for us to travel to user groups or SQL Saturdays unless we’re already in that particular city, but remote presentations are easy to do and are becoming more and more popular. We haven’t had any bandwidth problems doing remote presentations to groups as far away as South Africa, Australia, and New Zealand, plus Norway, Canada, UK, Poland, Belgium, Brazil, Czech Republic, and Ireland. This way we can spread the community love around user groups everywhere that we wouldn’t usually get to in person.

Note: we have our own Webex accounts which we generally use, or we can use your GoToMeeting or Webex, but you must use computer audio – we won’t call in by phone as the sound quality is too poor. We also will not use Skype/Lync as we’ve had too many problems with it around user group laptops and sound.

So, calling all user group leaders! If you’d like one of us (me, Kimberly, Jon, Erin, Glenn, Tim) to present remotely for you in 2018 (or maybe even multiple times), send me an email and be sure to include:

  • Details of which user group you represent (and if sending from a shared user group account, your name)
  • The usual day of the month, meeting time, and timezone of the user group
  • Which months you have available, starting in August 2018 (a list of available dates would be ideal)
  • Whether you’d like just one or multiple

And I’ll let you know who’s available with what topics so you can pick. We have around 20 topics across the team that we can present on.

What’s the catch? There is no catch. We’re just continuing our community involvement next year and we all love presenting :-)

There’s no deadline for this – send me an email at any time and we’ll see what we can do.

We’re really looking forward to engaging with you all!

Cheers

What is the FCB_REPLICA_SYNC spinlock?

A question came up on the Data Platform MVP email list last night asking what the FCB_REPLICA_SYNC spinlock is. I answered the question and then promised to do a quick blog post, as there’s no information online about it that I could find.

Explanation

In a nutshell, this spinlock is used to synchronize access to the list of pages that are present in a database snapshot, as follows:

  • If a page in a database with one or more database snapshots is being updated, check each snapshot’s list to see if the page is already in the snapshot. If yes, nothing to do. If no, copy the pre-change image of the page into the snapshot.
  • If a query is reading a page in the context of a database snapshot, check the list of pages to see whether to read from the snapshot or the source database.

This synchronization ensures that the correct copy of a page is read by a query using the snapshot, and that updated pages aren’t copied to the snapshot more than once.

The original question was because the person was seeing trillions of spins for the FCB_REPLICA_SYNC spinlock. That’s perfectly normal if there’s at least one database snapshot, a read workload on the snapshot, and a concurrent heavy update workload on the source database.

Example

For example, using our sample SalesDB database (zip file here), I created this query and set it running:

WHILE (1=1)
BEGIN
    UPDATE [SalesDB].[dbo].[Sales] SET [Quantity] = [Quantity] + 1;
END;
GO

Then I took my script to capture spinlock metrics over a period of time (see this post), changed it to capture over 20 seconds, and then ran a DBCC CHECKDB on the SalesDB database, which took 18 seconds.

The spinlock metrics returned were:

Spinlock                  DiffCollisions DiffSpins  SpinsPerCollision DiffSleepTime DiffBackoffs
------------------------- -------------- ---------  ----------------- ------------- ------------
BUF_HASH                  2              500        250               0             0
DBTABLE                   5              1250       250               0             0
FCB_REPLICA_SYNC          5716270        1513329500 264               0             154380
LOCK_HASH                 12             3500       291               0             1
LOGCACHE_ACCESS           6              387        64                0             3
LOGFLUSHQ                 4              75840      18960             0             3
LOGPOOL_HASHBUCKET        15             3750       250               0             0
LOGPOOL_SHAREDCACHEBUFFER 32             8000       250               0             0
LSLIST                    8              2000       250               0             0
SOS_SCHEDULER             3              1114       371               0             0
SOS_TASK                  1              356        356               0             0

You can see that even for a 20-second test, a single DBCC CHECKDB produced 1.5 billion spins on the FCB_REPLICA_SYNC spinlock.

This is perfectly normal.

One of the dangers of looking at spinlock metrics is that the numbers involved can be so high that it’s easy to convince yourself that there’s some kind of problem, especially as there’s so little information available online about what the spinlocks actually mean. The vast majority of the time, there’s no problem, but it takes a lot of internals knowledge to know what’s going on.

About the only thing I’d be concerned about is if there are multiple concurrent snapshots on a database with heavy update workload, as that will cause synchronous writes to all the snapshots when a page in the source database is first updated, slowing down the workload.

Investigation

One thing you can always do if you’re interested in what a specific spinlock means is to investigate with Extended Events. There’s a whitepaper I helped review called Diagnosing and Resolving Spinlock Contention on SQL Server that you can download here. In it there’s an Extended Event session that I use to see where spinlocks backoffs occur.

Here’s the session I used for FCB_REPLICA_SYNC (which maps to the type value of 136 in sys.dm_xe_map_values):

-- Drop the session if it exists. 
IF EXISTS (SELECT * FROM sys.server_event_sessions WHERE [name] = N'WatchSpinlocks')
    DROP EVENT SESSION [WatchSpinlocks] ON SERVER
GO

CREATE EVENT SESSION [WatchSpinlocks] ON SERVER
ADD EVENT [sqlos].[spinlock_backoff]
    (ACTION ([package0].[callstack])
    WHERE [type] = 136) -- FCB_REPLICA_SYNC only
ADD TARGET [package0].[asynchronous_bucketizer] (
    SET filtering_event_name = N'sqlos.spinlock_backoff',
    source_type = 1, -- source_type = 1 is an action
    source = N'package0.callstack') -- bucketize on the callstack
WITH (MAX_MEMORY = 50MB, MAX_DISPATCH_LATENCY = 5 seconds)
GO

-- Start the session
ALTER EVENT SESSION [WatchSpinlocks] ON SERVER STATE = START;
GO

-- TF to allow call stack resolution
DBCC TRACEON (3656, -1);
GO

-- Cause some spinlock backoffs

-- Get the callstacks from the bucketizer target
SELECT
    [event_session_address],
    [target_name],
    [execution_count],
    CAST ([target_data] AS XML)
FROM sys.dm_xe_session_targets [xst]
INNER JOIN sys.dm_xe_sessions [xs]
    ON ([xst].[event_session_address] = [xs].[address])
WHERE [xs].[name] = N'WatchSpinlocks';
GO

-- Stop the event session
ALTER EVENT SESSION [WatchSpinlocks] ON SERVER
STATE = STOP;
GO

You’ll need to download the debug symbols for the build you’re using – see here for my instructions on how to do this.

I started the event session and re-ran the test. A sampling of the call stacks is below, with matching explanations.

Explanation:

  1. Pushing a page into a snapshot just before it gets modified in the source database
  2. Reading a page from a snapshot (in this case, from one of DBCC CHECKDB‘s parallel threads performing readahead)
  3. Pulling a page in to a snapshot while crash recovery is running on the new snapshot to make it a transactionally-consistent view of the source database (in this case, it’s a ‘transient’ database snapshot that DBCC CHECKDB has created)

And plenty more similar call stacks.

Call stack:

  1. XeSosPkg::spinlock_backoff::Publish+0x138
    SpinlockBase::Sleep+0xc5
    SpinlockBase::Backoff+0x145
    Spinlock<136,4,1>::SpinToAcquireWithExponentialBackoff+0x169
    FCBReplicaSync::StartWrite+0x7f
    FCB::CopyPageToReplicas+0x212
    BUF::CopyOnWrite+0x60
    BPool::PrepareToDirty+0x180
    IndexPageRef::Modify+0x146
    BTreeRow::UpdateRecord+0x20ab
    IndexDataSetSession::SetDataInternal+0x9a03
    DatasetSession::SetData+0x16d
    RowsetNewSS::SetData+0x6a
    CValRow::SetDataX+0x63
    CEsExec::GeneralEval4+0xe7
    CQScanUpdateNew::GetRow+0x24b
    CQueryScan::GetRow+0x81
    CXStmtQuery::ErsqExecuteQuery+0x5be
    CXStmtDML::XretDMLExecute+0x31c
    CXStmtDML::XretExecute+0xad
    CMsqlExecContext::ExecuteStmts<0,1>+0x8bd
    CMsqlExecContext::FExecute+0xa68
    CSQLSource::Execute+0x86c
    CStmtPrepQuery::XretPrepQueryExecute+0x464
  2. XeSosPkg::spinlock_backoff::Publish+0x138
    SpinlockBase::Sleep+0xc5
    Spinlock<136,4,1>::SpinToAcquireWithExponentialBackoff+0x169
    FCBReplicaSync::StartRead+0x86
    FCB::ScatterRead+0x1b3
    RecoveryUnit::ScatterRead+0xa9
    BPool::GetFromDisk+0x719
    BPool::ReadAhead+0x7e
    MultiObjectScanner::GetNextPageAndReadAhead+0x38e
    MultiObjectScanner::GetNext+0x98
    MultiObjectScanner::GetNextPageAndBatch+0x2fc
    CheckTables::ProcessNextData+0x1bb
    CheckAggregateSingleInstance::GetNextFact+0x28e
    CTRowsetInstance::FGetNextRow+0x3c
    CUtRowset::GetNextRows+0xa0
    CQScanRmtScanNew::GetRowHelper+0x3b8
    CQScanXProducerNew::GetRowHelper+0x53
    CQScanXProducerNew::GetRow+0x15
    FnProducerOpen+0x57
    FnProducerThread+0x8c3
    SubprocEntrypoint+0xa7f
    SOS_Task::Param::Execute+0x21e
    SOS_Scheduler::RunTask+0xab
    SOS_Scheduler::ProcessTasks+0x279
  3. XeSosPkg::spinlock_backoff::Publish+0x138
    SpinlockBase::Sleep+0xc5
    Spinlock<136,4,1>::SpinToAcquireWithExponentialBackoff+0x169
    FCBReplicaSync::StartWrite+0x7f
    FCB::PullPageToReplica+0x35
    FCB::CopyPageToReplicas+0x12c
    BUF::CopyOnWrite+0x60
    BPool::PrepareToDirty+0x180
    PageRef::ModifyRow+0x24a
    IndexPageRef::Modify+0x19f2
    BTreeRow::UpdateRecord+0x20ab
    IndexDataSetSession::UndoSetData+0x4d9
    XdesRMReadWrite::IndexModify+0x61
    XdesRMReadWrite::UndoPageOperation+0x10da
    XdesRMReadWrite::RollbackToLsn+0x7d6
    RecoveryMgr::UndoRegularXacts+0xb09
    RecoveryMgr::RollbackRemaining+0x137
    RecoveryUnit::DoRollbackRecovery+0x19
    RecoveryUnit::CompleteRecovery+0x6b8
    RecoveryUnit::PhaseStart+0x87
    DBTABLE::StartupPostRecovery+0x4d
    DBTABLE::ReplicaCreateStartup+0x284
    DBMgr::SyncAndLinkReplicaRecoveryPhase+0x787
    DBMgr::CreatePhasedTransientReplica+0x717

Summary

So there you have it. The FCB_REPLICA_SYNC spinlock is to do with database snapshot reads and writes, and high numbers around it are expected with concurrent updates in the source database and reads in the snapshot.

New live online training class in October: Fixing Slow Queries, Inefficient Code, and Caching/Statistics Problems

Continuing our series of live, online classes, Erin, Jonathan, and Kimberly will be delivering their new IEQUERY: Immersion Event on Fixing Slow Queries, Inefficient Code, and Caching/Statistics Problems in October! The class will be delivered live via WebEx on October 23-25 (roughly 12-13 hours of content including Q&As; about the same as two full workshop days!) and the attendees will have lifetime access to the recordings following the end of the class.

Rather than have people try to watch a full day of training at their computer for one of more days, the class will run from 10am to 3pm PST each day, with two 90-minute teaching sessions, each followed by Q&A, and a lunch break. We chose to do this, and to spread the class over a few days, so the times work pretty well for those in the Americas, Africa, and Europe. We also realize that this is complex content, so want to give attendees time to digest each day’s material, plus extensive Q&A.

Here are some select quotes from prior attendees of Erin’s/Jon’s/Kimberly’s online classes:

  • “Extremely pleased with the course. FAR exceeded my expectations.”
  • “Well worth the time and expense to attend. Would highly recommend this to others.”
  • “Great course – very informative – very great instructors – I am sure to be back!”
  • “Great course. Good new info for me, plus refresher on other info. Thanks!”
  • “Both Erin and Jon have a vast knowledge of not only SQL Server & tools, but also effective presentation.”
  • “Thanks for taking the time to better my knowledge of SQL and allow me to better my career.”
  • “Kimberly is incredibly knowledgeable and was able to adapt the techniques to all the different scenarios presented to her.”
  • “Great course. I could hear clearly, the content was relevant to current day problems, and provided clear instruction.”
  • “Loved the online aspect. It felt like I was there with the question ability and having the questions just answered right there. I felt I had a voice and could ask anything and the ability to watch it later made it totally worth the registration.”
  • “I really enjoyed the ability to ask questions as the course went along so that I didn’t forget what I wanted to ask while you were teaching. This allowed for questions to come through and class to continue until a good stopping point to answer the questions. Plus having the questions written from other attendees was nice for future reference instead of trying to remember from an in-person class discussion.”

The class is split into three parts, with each part taught by a different instructor:

  • Part 1/Day 1: Capturing Query Information and Analyzing Plans (presented by Erin Stellato)
    • Baselining options and considerations
    • Sources of query performance data (e.g. DMVs, Extended Events or Trace)
    • Capturing and comparing execution plans
    • Finding essential information in a plan
    • Misleading information in a plan
    • Common operators
    • Operators and memory use
    • Predicates and filters
    • Parallelism in plans
  • Part 2/Day 2: Removing Anti-Patterns in Transact-SQL (presented by Jonathan Kehayias)
    • Set based concepts for developers
    • Design considerations that affect performance
    • Reducing/eliminating row-by-row processing
      • CURSORs and WHILE Loops, scalar UDFs, TVFs
    • Understanding Sargability and eliminating index scans in code
    • Profiling during development and testing properly
  • Part 3/Day 3: How to Differentiate Caching / Statistics problems and SOLVE THEM! (presented by Kimberly L. Tripp)
    • Troubleshooting Statement Execution and Caching
      • Different ways to execute statements
      • Some statements can be cached for reuse
      • Statement auto-parameterization
      • Dynamic string execution
      • sp_executesql
      • Stored procedures
      • Literals, variables, and parameters
      • The life of a plan in cache
      • Plan cache limits
      • Bringing it all together
    • Troubleshooting Plan Problems Related to Statistics (not Caching)
      • Statement selectivity
      • What kinds of statistics exist
      • How does SQL Server use statistics
      • Creating additional statistics
      • Updating statistics

The price of the class is US$699 (or US$599 for prior live, online attendees) and there’s also a combo price for all three new classes announced today.

You can get all the details here.

The class was also announced in our newsletter today, with a US$100 discount for those people who received that newsletter, valid through the end of June. All future live, online classes will always feature a discount for newsletter subscribers.

We decided to start teaching some live, online classes as we recognize that not everyone can travel to our in-person classes, or take that time away from work or family, or simply have travel budget as well as training budget. People also have different ways they learn, some preferring in-person training, some preferring recorded, online training, and some preferring live, online training.

We’ll be doing more of these so stay tuned for updates (and discounts through the newsletter).

We hope you can join us!

New live online training class in October: Transactions, Locking, Blocking, Isolation, and Versioning

Continuing our series of live, online classes, Kimberly will be delivering her new IETLB: Immersion Event on Transactions, Locking, Blocking, Isolation, and Versioning in October! The class will be delivered live via WebEx on October 9-11 (roughly 12-13 hours of content including Q&As; about the same as two full workshop days!) and the attendees will have lifetime access to the recordings following the end of the class.

Rather than have people try to watch a full day of training at their computer for one of more days, the class will run from 10am to 3pm PST each day, with two 90-minute teaching sessions, each followed by Q&A, and a lunch break. We chose to do this, and to spread the class over a few days, so the times work pretty well for those in the Americas, Africa, and Europe. We also realize that this is complex content, so want to give attendees time to digest each day’s material, plus extensive Q&A.

Here are some select quotes from prior attendees of Kimberly’s live, online classes:

  • “Kimberly is incredibly knowledgeable and was able to adapt the techniques to all the different scenarios presented to her.”
  • “The best educator I’ve ever seen. She makes complex concepts “magically” easy to grasp. Incred-amazing.”
  • “Great course. I could hear clearly, the content was relevant to current day problems, and provided clear instruction.”
  • “This was REALLY good. Getting to an IE is tough there are only a few a year and more importantly because there are only a few they cover a pretty broad range of information. Since I do mainly database design and query tuning, I can’t justify much beyond IE1 as I don’t do day to day DBA work. Seeing you were offering the online event focused on a specific topic – very large tables – was PERFECT. I know I really need to improve my knowledge of the options in that area. I recalled the PV/PT architecture from IE1 and knew a refresher, coupled with the new information would be perfect. The cost was BEYOND reasonable. The time frame, at only about 1/2 a day, was easy to justify and easy to manage keeping up with regular work while I did it. So this worked out to be a perfect event.” – Todd Everett, USA
  • “Loved the online aspect. It felt like I was there with the question ability and having the questions just answered right there. I felt I had a voice and could ask anything and the ability to watch it later made it totally worth the registration.”
  • “I really enjoyed the ability to ask questions as the course went along so that I didn’t forget what I wanted to ask while you were teaching. This allowed for questions to come through and class to continue until a good stopping point to answer the questions. Plus having the questions written from other attendees was nice for future reference instead of trying to remember from an in-person class discussion.”

The modules covered will be:

  • Batches, Transactions, and Error Handling
  • The Anatomy of a Data Modification
  • Locking / Isolation
  • Table Maintenance and Schema Locks
  • Locking, Blocking, and an Intro to Deadlocks
  • Versioning

The price of the class is US$699 (or US$599 for prior live, online attendees) and there’s also a combo price for all three new classes announced today.

You can get all the details here.

The class was also announced in our newsletter today, with a US$100 discount for those people who received that newsletter, valid through the end of June. All future live, online classes will always feature a discount for newsletter subscribers.

We decided to start teaching some live, online classes as we recognize that not everyone can travel to our in-person classes, or take that time away from work or family, or simply have travel budget as well as training budget. People also have different ways they learn, some preferring in-person training, some preferring recorded, online training, and some preferring live, online training.

We’ll be doing more of these so stay tuned for updates (and discounts through the newsletter).

We hope you can join us!

New live online training class in August: Planning and Implementing an Upgrade/Migration to SQL Server 2017

Continuing our series of live, online classes, Glenn will be delivering his new IEPUM2017: Immersion Event on Planning and Implementing an Upgrade/Migration to SQL Server 2017 in August! The class will be delivered live via WebEx on August 28-30 (roughly 12-13 hours of content including Q&As; about the same as two full workshop days!) and the attendees will have lifetime access to the recordings following the end of the class.

Rather than have people try to watch a full day of training at their computer for one of more days, the class will run from 10am to 3pm PST each day, with two 90-minute teaching sessions, each followed by Q&A, and a lunch break. We chose to do this, and to spread the class over a few days, so the times work pretty well for those in the Americas, Africa, and Europe. We also realize that this is complex content, so want to give attendees time to digest each day’s material, plus extensive Q&A.

Here are some select quotes from prior attendees of Glenn’s in-person classes:

  • “Glenn is always so patient in answering my numerous questions.”
  • “Course information was very relevant since we are in the midst of migating our on-premise production environment to the cloud and upgrading to SQL Server 2016 or 2017.”
  • “Customers stories were a nice complement to the materials.”
  • “Great info for installing from the ground up.”

The modules covered will be:

  • Upgrade Planning
  • Hardware and Storage Selection
  • SQL Server 2017 Installation and Configuration
  • Upgrade Testing
  • Migration Planning
  • Production Migration Methods

The price of the class is US$699 (or US$599 for prior live, online attendees) and there’s also a combo price for all three new classes announced today.

You can get all the details here.

The class was also announced in our newsletter today, with a US$100 discount for those people who received that newsletter, valid through the end of June. All future live, online classes will always feature a discount for newsletter subscribers.

We decided to start teaching some live, online classes as we recognize that not everyone can travel to our in-person classes, or take that time away from work or family, or simply have travel budget as well as training budget. People also have different ways they learn, some preferring in-person training, some preferring recorded, online training, and some preferring live, online training.

We’ll be doing more of these so stay tuned for updates (and discounts through the newsletter).

We hope you can join us!

CXCONSUMER wait type – history and what you need to know

I’m a little bit overdue to blog about this new wait type, but I wanted to wait until SQL Server 2016 SP2 was released for a bit and people started seeing this new wait type.

History

Back in September 2016 I created a Connect item (3102145) to split the CXPACKET wait into benign and actionable waits, so that when a CXPACKET wait happens, there’s really something to investigate. Unfortunately the Connect website has been shut down by Microsoft so I can’t point you to the original request I made (I also tried in the Internet Archive but couldn’t find it).

The reason I wanted this change is that CXPACKET waits have always been registered by both producer and consumer threads for query plan operators where some threads produce data (i.e. producer threads) and some threads consume the produced data (i.e. consumer threads). However, the waits for the consumer threads are not actionable, because it’s the *producer* threads that are the cause of the consumer thread waits, and so it’s the producer thread waits that are actionable. By splitting the consumer waits out, the number of CXPACKET waits should be reduced, and those that are left should be investigated.

During the PASS Summit in 2017, my friend Pedro Lopes (b|t) on the Tiger Team announced that they’d made the change. The new wait type, CXCONSUMER, was initially added to SQL Server 2017 RTM CU3 and was added to SQL Server 2016 SP2 in April. You can read Pedro’s blog post here. Note that there was initially a bug with the implementation which was fixed in SQL Server 2017 RTM CU4.

What You Need To Know

CXCONSUMER waits can be ignored. And I’ve seen control threads (thread ID = 0) show CXPACKET or CXCONSUMER, depending on what’s happening in the query plan.

I’ve added a CXCONSUMER page to my waits library, and added it to the ‘ignore’ list in all published locations of my wait stats script.

PS And if you do have CXPACKET waits, start here. Don’t just blindly follow random internet advice to set your sp_configure ‘max degree of parallelism’ or ‘cost threshold for parallelism’ options so a set value without testing how it affects your workload and whether a higher or lower number is better for your environment.

SQLskills SQL101: Why do some wait types need to be ignored?

SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

Wait statistics analysis is one of my favorite things to talk about because it’s so incredibly useful for performance tuning and can dramatically shorten the time it takes to zero in on the root cause of a performance problem. But you have to do it correctly. You can’t just do a SELECT * FROM sys.dm_os_wait_stats. Various people have published scripts online to aggregate and display wait statistics in an actionable way, and my script is one of the most popular (latest version is always in this post).

One question I’m often asked is why does my script have a list of wait types that it specifically filters out? The answer is that those wait types are what I call ‘benign’ – they’re usually not a problem but happen frequently enough from regular SQL Server operations that they would show up as the top waits and so would obscure the waits that you can do something about.

For instance, if I take my waits script and remove all the filtering of benign waits, the results on my laptop where I’m forcing a tempdb contention problem are as follows:

WaitType                            Wait_S    Resource_S  Signal_S WaitCount Percentage AvgWait_S AvgRes_S AvgSig_S Help/Info URL
----------------------------------- --------- ----------  -------- --------- ---------- --------- -------- -------- -----------------------------------------------------------------------
SLEEP_TASK                          123335.21 123326.43   8.77     5232828   10.68      0.0236    0.0236   0.0000   https://www.sqlskills.com/help/waits/SLEEP_TASK
DIRTY_PAGE_POLL                     82215.60  82214.61    0.98     808502    7.12       0.1017    0.1017   0.0000   https://www.sqlskills.com/help/waits/DIRTY_PAGE_POLL
HADR_FILESTREAM_IOMGR_IOCOMPLETION  82215.08  82214.43    0.65     163809    7.12       0.5019    0.5019   0.0000   https://www.sqlskills.com/help/waits/HADR_FILESTREAM_IOMGR_IOCOMPLETION
LOGMGR_QUEUE                        82213.89  82210.58    3.31     669980    7.12       0.1227    0.1227   0.0000   https://www.sqlskills.com/help/waits/LOGMGR_QUEUE
SQLTRACE_INCREMENTAL_FLUSH_SLEEP    82212.97  82212.94    0.03     20546     7.12       4.0014    4.0014   0.0000   https://www.sqlskills.com/help/waits/SQLTRACE_INCREMENTAL_FLUSH_SLEEP
REQUEST_FOR_DEADLOCK_SEARCH         82212.74  0.00        82212.74 16442     7.12       5.0002    0.0000   5.0002   https://www.sqlskills.com/help/waits/REQUEST_FOR_DEADLOCK_SEARCH
LAZYWRITER_SLEEP                    82210.41  82209.82    0.59     86524     7.12       0.9501    0.9501   0.0000   https://www.sqlskills.com/help/waits/LAZYWRITER_SLEEP
CHECKPOINT_QUEUE                    82204.96  82204.92    0.04     125       7.12       657.6396  657.6394 0.0003   https://www.sqlskills.com/help/waits/CHECKPOINT_QUEUE
XE_TIMER_EVENT                      82204.08  0.00        82204.08 37409     7.12       2.1974    0.0000   2.1974   https://www.sqlskills.com/help/waits/XE_TIMER_EVENT
QDS_CLEANUP_STALE_QUERIES_TASK_
                   MAIN_LOOP_SLEEP  82201.37  82201.36    0.01     1371      7.12       59.9572   59.9572  0.0000   https://www.sqlskills.com/help/waits/QDS_CLEANUP_STALE_QUERIES_TASK_MAIN_LOOP_SLEEP
QDS_PERSIST_TASK_MAIN_LOOP_SLEEP    82201.29  82201.28    0.01     1371      7.12       59.9572   59.9572  0.0000   https://www.sqlskills.com/help/waits/QDS_PERSIST_TASK_MAIN_LOOP_SLEEP
SP_SERVER_DIAGNOSTICS_SLEEP         82200.36  0.00        82200.36 299612    7.12       0.2744    0.0000   0.2744   https://www.sqlskills.com/help/waits/SP_SERVER_DIAGNOSTICS_SLEEP
XE_DISPATCHER_WAIT                  82198.10  82198.10    0.00     686       7.12       119.8223  119.8223 0.0000   https://www.sqlskills.com/help/waits/XE_DISPATCHER_WAIT

All of these are benign waits happening on system threads. For instance, you can see that the lazy writer thread on my instance is waking up every 1 second or so to check for memory pressure in the buffer pool, finding none, and then sleeping again (that’s the LAZYWRITER_SLEEP wait type, showing an average of 0.95s average resource wait time in the AvgRes_S column). You can also see that nearly all of these waits have a total wait time of around 82, 200 seconds, which is how long my laptop has been running since its last reboot.

The point is that these waits always occur and if you don’t filter them out, they will show up as the most prevalent wait types on your instance, and they’re not related to performance issues.

When I put the filters back in, and re-run the script, I get the following output:

WaitType                            Wait_S    Resource_S  Signal_S WaitCount Percentage AvgWait_S AvgRes_S AvgSig_S Help/Info URL
----------------------------------- --------- ----------  -------- --------- ---------- --------- -------- -------- -----------------------------------------------------------------------
PAGELATCH_UP                        3451.97   3312.34     139.63   502282    56.73      0.0069    0.0066   0.0003   https://www.sqlskills.com/help/waits/PAGELATCH_UP
PAGELATCH_SH                        2324.96   1449.37     875.60   2030686   38.21      0.0011    0.0007   0.0004   https://www.sqlskills.com/help/waits/PAGELATCH_SH
LATCH_EX                            217.89    214.96      2.94     7628      3.58       0.0286    0.0282   0.0004   https://www.sqlskills.com/help/waits/LATCH_EX

Now I can see an indication of a problem and know to go look at my sys.dm_os_waiting_tasks script (latest version always here) for further investigation (and note the automatically-generated URLs which will take you to the relevant page of my waits library for explanation of the wait types and troubleshooting advice).

Bottom line: always make sure you’re filtering out benign wait types so you’re not trying to troubleshoot a problem that you can’t do anything about.

Read committed doesn’t guarantee much…

A while back I was involved in an email thread where people were wondering about some ‘weird’ behavior from SQL Server. The problem was occurring on SQL Server 2016 and used the default isolation level of read committed. The scenario was as follows:

  • Create a table with a few columns
  • Batch 1: In one SSMS window, do the following (which takes 10 seconds to run):
    • Start a transaction
    • Insert 1,000 rows into the table, with a 0.01 second WAITFOR DELAY between each insert
    • Commit the transaction
  • Batch 2: In a second SSMS window:
    • Select * from the table

The ‘weird’ behavior is that when the “Batch 2” select completes, after having been blocked by the “Batch 1” transaction, it doesn’t return all 1,000 rows (even though “Batch 1” has completed). Furthermore, depending on when the “Batch 2” select is started, during the 10-seconds that “Batch 1” executes, “Batch 2” returns different numbers of rows. This behavior had also been reported on earlier versions of SQL Server as well. It’s easy to reproduce on SQL Server 2016/2017 and can be reproduced in all earlier versions with a single configuration change (more details in a moment).

Additionally, if the table has a clustered index created, 1,000 rows are returned every time, on all versions of SQL Server.

So why is this weird? Many people expect that all 1,000 rows will be returned every time AND that the structure of the table or the version of SQL Server should not make any difference.

Unfortunately, that assumption is not correct when using read committed. The read committed isolation level guarantees that only committed data will be read; it does NOT guarantee how much of the committed data will be returned and is prone to inconsistent analysis problems. If you want to avoid inconsistent analysis, you’d need to increase your isolation level (if using locking) or change to versioning for read committed (or even snapshot isolation).

However, I do agree that the assumption is reasonable, even though it’s not correct.

The definition of the requirement is straightforward however. SQL Server guarantees that only committed data will be read, not how much of it will be read, and so the amount of committed data returned in this case is variable. Furthermore, depending on the table structure, SQL Server version, and configuration option (which I’ve mentioned twice now, but haven’t explained as I don’t want to spoil my story…), the number of rows returned will vary wildly.

So, what’s going on? Why the discrepancy in the number of rows?

It’s because the table is a heap.

The ‘weird’ behavior manifests itself when the heap has an extent allocated to it immediately, from which the first data pages are allocated. When the allocation-order scanner for the select starts, it looks at the PFS bytes for the 8 pages in the extent to see which ones are allocated. These will be scanned. Depending on when the select starts within the 10 seconds that “batch 1” executes, there will be more rows or fewer rows read by the scanner (and returned in the “batch 2” results) because more or fewer of the PFS byte 0x40 bits will have been set indicating the page in the extent is allocated.

For a table to have an extent allocated to it immediately, mixed pages have to be disabled, which is the default for SQL Server 2016 onward, which is why the ‘weird’ behavior manifests with a small number of rows on SQL Server 2016. However, if you’ve enabled trace flag 1118, you’ll see this behavior in all versions. This is why some people report seeing the ‘weird’ behavior on versions earlier than SQL Server 2016.

When mixed extents are NOT disabled, i.e. in earlier versions than SQL Server 2016 when 1118 is not enabled, the first 8 pages allocated are mixed pages, and so the allocation order scanner has to pick them up individually and sees them all (without going into details of the synchronization around the single-page slot array on the first IAM page in the IAM chain/allocation unit for the table), and so as long as no more than 8 data pages are used by the insert, all the rows on them will be returned by the select in this example.

So there you have it, the behavior is “by design” and hasn’t changed at all, but is understandably confusing unless you know exactly how the Engine is performing the select under the covers.

And if you create a clustered index, the ‘weird’ behavior doesn’t occur for this example in any version. This is because the select gets hung up on the S(hare) row lock for the first row in the clustered index, and by the time the select is granted the S lock, the insert has completed and the index leaf scan picks up all 1,000 rows.

Don’t think that using a clustered index stops ‘weird’ behavior using read committed; it doesn’t. However, it can reduce some forms of inconsistencies. In this old blog post, I demonstrate a scenario where a scan of clustered index returns four rows when only three rows exist, using read committed. Kimberly also describes and shows “inconsistent analysis” in these blog posts:

So to summarize: the read committed isolation guarantees that only committed data is read; however, there are many forms of potential concerns (known as “inconsistent analysis”). It really pays to understand what these are and what the various isolation levels that SQL Server provides do and do not allow. In addition to the blog posts above, check out the MSDN lesson titled: Understanding the Available Transaction Isolation Levels here.

SQLskills SQL101: Why does repair invalidate replication subscriptions?

SQLskills has an ongoing initiative to blog about basic topics, which we’re calling SQL101. We’re all blogging about things that we often see done incorrectly, technologies used the wrong way, or where there are many misunderstandings that lead to serious problems. If you want to find all of our SQLskills SQL101 blog posts, check out SQLskills.com/help/SQL101.

Whenever I’m teaching and recovering from corruption, I always stress that if the REPAIR_ALLOW_DATA_LOSS option of DBCC CHECKDB (which I’ll just call ‘repair’ from now on) has to be used, you should do the following:

  • Run another DBCC CHECKDB to make sure that repair fixed everything and no new problems were exposed (by the repair fixing something that was blocking DBCC CHECKDB‘s access to a portion of the database)
  • Run DBCC CHECKCONSTRAINTS on any tables or indexes that were repaired and are involved in constraints (stated in Books Online too)
  • Reinitialize any replication subscriptions that are based on tables that were repaired (stated in Books Online too)
  • Reevaluate your HA/DR strategy so you don’t have to use repair again in future

One question I’m often asked is about why replication can be broken by running repair.

There are two ways that replication can be affected: repairs on replication metadata tables, and repairs on anything else to do with a subscription.

Repairs on replication metadata tables

This is the simplest case to explain. If the repair operation affects any of the replication metadata tables (i.e. deleted some data from them), the entire replication publication will be in an inconsistent state and you should remove replication completely from the database an reinitialize it. This isn’t limited to a single subscription – all replication should be reconfigured.

Repairs on anything else

Transaction replication captures changes to the publication database by analyzing the transaction log, looking for transactions that change data in any of the publications, and converting those operations into logical operations that can be applied to the subscribers. Merge replication captures changes to the publication database using DML triggers and converting those operations into logical operations that can be applied to the subscribers.

Neither of these mechanisms can capture of operations performed by repair. Repair operations are always direct physical changes to the database structures to fix inconsistencies in the structures (e.g. a database page, table record, or a linkage between two pages), as opposed to physical changes because of queries performing inserts, updates, or deletes on tables.

These repair operations cannot translated into logical operations that can be applied to replication subscribers because there are no logical operations than can be expressed using T-SQL for the equivalent of the direct structural changes that repair is performing. Replication does not preserve the exact physical location of a particular record between the publication and subscription databases, so a direct change to record Y on page X in the publication database would not be able to be replayed on the subscription database (remember, replication ships logical changes, not physical changes). This means that if any table that is part of a replication publication is changed by a repair operation, the replication subscription is no longer valid and must be reinitialized.

As an example, imagine a repair operation is forced to remove a data page from a unique clustered index (essentially deleting some table records), and the subscription is NOT reinitialized. Those records would still exist on the replicated copy of the table. If a subsequent insert operation inserted records with cluster key values corresponding to the records deleted by the repair operation, the Distribution Agent would fail when trying to apply the inserts to the replicated copy – because the repair operation was not applied to the subscription database and a duplicate key violation error will occur when attempting to apply the insert to the replicated table.

A replication subscription must always be reinitialized if any table in the publication is affected by a repair operation, or the replication metadata tables are repaired.

Thanks

New live online training class on using Query Store in May!

Continuing our series of live, online classes, Erin will be delivering her new IEQS: Immersion Event on Solving Common Performance Problems with Query Store in May! The class will be delivered live via WebEx on May 22-23 (roughly 12-13 hours of content including Q&As; about the same as two full workshop days!) and the recording will be available to attendees for six months following the end of the class.

Rather than have people try to watch a full day of training at their computer for one of more days, the class will run from 10am to 3pm PST each day, with two 90-minute teaching sessions, each followed by Q&A, and a lunch break. We chose to do this, and to spread the class over a few days, so the times work pretty well for those in the Americas, Africa, and Europe. We also realize that this is complex content, so want to give attendees time to digest each day’s material, plus extensive Q&A.

Here are some select quotes from prior attendees of Erin’s in-person classes:

  • “Extremely pleased with the course. FAR exceeded my expectations.”
  • “Well worth the time and expense to attend. Would highly recommend this to others.”
  • “I though this was extremely worthwhile. I have been a DBA for a few years and this was a great refresher for a lot of things I don’t do very often but should.”
  • “Found class extremely helpful and plan to take the next one.”
  • “Great course – very informative – very great instructors – I am sure to be back!”
  • “I was familiar with all of the topics in this course, but barely understood most. I feel I will be able to begin to do all of the things I need to take simpler tasks from our DBAs to allow them to focus on the more advanced tasks. That was exactly my goal for taking this class.”
  • “Great course. Good new info for me, plus refresher on other info. Thanks!”
  • “Both Erin and Jon have a vast knowledge of not only SQL Server & tools, but also effective presentation.”
  • “Thanks for taking the time to better my knowledge of SQL and allow me to better my career.”

The modules covered will be:

  • Query Store Fundamentals
  • Understanding the Query Store Data
  • Workload Characteristics and Query Store Performance
  • Finding Performance Issues
  • Forcing Plans
  • Automatic Tuning
  • Other Uses of Query Store
  • Visualizing Query Store Data

The price of the class is US$795 (or US$695 for prior live, online attendees) and you can get all the details here.

The class was also announced in our newsletter today, with a US$100 discount for those people who received that newsletter, valid for two weeks. All future live, online classes will always feature a two-week discount for newsletter subscribers.

We decided to start teaching some live, online classes as we recognize that not everyone can travel to our in-person classes, or take that time away from work or family, or simply have travel budget as well as training budget. People also have different ways they learn, some preferring in-person training, some preferring recorded, online training, and some preferring live, online training.

We’ll be doing more of these in 2018, on a variety of topics, so stay tuned for updates (and discounts through the newsletter).

We hope you can join us!