Performance bug: NOLOCK scans involving off-row LOB data

Here's an interesting bug that was just recently fixed, and someone actually hit it today, so it's time to blog it.

Bottom line is that if you're doing NOLOCK scans of tables involving LOB data, the perfmance might suck. 2005 SP3 CU3 has the bug fix, but unfortunately the KB article was written by someone who didn't seem to know what they were talking about, so I suggested it was rewritten, which it has been (but not republished yet). The original KB article is at http://support.microsoft.com/kb/961049/. It's also fixed in 2008, but I don't know which build (I believe CU4 at least).

Here's my explanation of the problem.

LOB data can be stored in-row or off-row (my previous LOB post Importance of choosing the right LOB storage technique has more details). When it's stored off-row, it must be accessed by first reading the pointer to the LOB data from the data/index record, and then following the pointer (remember that 'record' is synonymous with 'row'). When a LOB data value is updated, the off-row value is updated first, and then if the off-row link changed, the data/index record is updated with the new link. There's obvioulsy a window here, where someone reading the data/index record might see the wrong/non-existent off-row pointer. This is exactly what can happen when a NOLOCK scan occurs.

To mitigate the possibility of a NOLOCK scan trying to follow a bad off-row link, the old behavior was to scan all the IAM pages (see Inside the Storage Engine: IAM pages, IAM chains, and allocation units) for the table/index to make sure that the off-row link actually pointed to a page allocated to the table/index. If there are lots of IAM pages, this means lots of logical IOs, and poorly performing SELECT queries. And it does the scan once for *every row*. The person that hit it today had a 500 row select of ~20KB per row taking 20 seconds – 10MB of physical IOs and 30MB of logical IOs!

The fix is to make further use of an already existing in-memory cache of IAM pages to do a quick lookup of the right IAM page covering the GAM interval of the LOB page being read, without having to scan the whole IAM chain.

Hope this helps explain things, and track down perf problems for some of you.

Blog series on computer memory architectures

(Thanks to David Baffaleuf for pointing this out to me)

Ever wanted to know a bit deeper how memory works in your laptop or servers? This comprehensive series of blog posts will give you a great overview. It's very clearly written and goes very deep. At times it talks about how things show up in the Linux OS, but all the hardware details and concepts (caches, NUMA, etc) translate exactly into our world.

It covers:

  • Part 1 (Introduction to memory)
  • Part 2 (CPU caches)
  • Part 3 (Virtual memory)
  • Part 4 (NUMA systems)
  • Part 5 (What programmers can do – cache optimization)
  • Part 6 (What programmers can do – multi-threaded optimizations)
  • Part 7 (Memory performance tools)
  • Part 8 (Future technologies)
  • Part 9 (Appendices and bibliography)

[Edit: And as Kalen points out in her comment below, you can get the whole thing as a PDF here.]

Enjoy!

PS This is something I'll start posting on in the rest of the year too, as far as SQL Server is concerned.

Why isn’t there automatic in SQL Server?

I was in a discussion earlier today, and it's one I've had lots of times in the past – both inside and outside Microsoft and the SQL team: why doesn't SQL Server do automatic <defrag, index creation, refactoring, de/normalization, backups, CHECKDBs, etc>?

Some of these were considered while I was still on the team, and I was *very* cautious about them. Here's why, taking automatic table de/normalization as an example (this morning's discussion). Assuming the logic to do the de/normalization based on usage patterns can be worked out, here's why I don't think it will ever happen (just off the top of my head):

  • If it's wrong even once, no-one will ever use it again.
  • When should SQL Server change the table definitions? What time of day is best?
  • Where should the new table be placed? In which filegroup?
  • What if there's no space to do it – should the database autogrow because of an automatic process? And then should it shrink afterwards?
  • How far should the change plan parallelize?
  • Who manages the transaction log size if the table is really large? Automatically take log backups? What if there's no space for the log backups? What about space on the log shipping secondaries? And the time required to roll-forward the log on the log-shipping secondaries?
  • What about the impact of extra transaction log on database mirroring? Both in terms of the huge SEND queue preventing log truncation, and the huge REDO queue preventing the mirror coming online quickly?
  • What if a cluster failover occurs during the operation and the rollback prevents the database coming online within the company's RTO?
  • What if there are an replication topologies setup? How do they get quiesced for the schema change?
  • How does the app get changed to handle the different schema, or do views get created automatically? Indexed views or not?
  • What if Change Data Capture is running? Should it automatically create a new capture instance? What if both are already in use?
  • How do SPs get updated to the new schema? Build in the datadude engine as part of SQL Server?

My point: it's a lot harder than you think to just put some automatic behavior into SQL Server. 

Now, saying that, I hope that some of the others do happen at some point, but with lots of config parameters so I can control them.