Do changes to index keys really do in-place updates?

There was an interesting discussion on Twitter this morning (started by my good friend Erin Stellato (blog|twitter)) about in-place updates for index records where the key value changes and the record remains on the same page. Various sources including the SQL Server 2008 Internals book (pg 361 – I didn't write or review that bit :-) describe the process, but they describe it incorrectly unfortunately. They say that the record will remain in the exact same location on the same page and only the key storage bytes need to change.

This is not true. True in-place updates of index key values do not happen in 2005 onwards. Let's go down the rabbit hole…

Here's a simple example:

CREATE DATABASE KeyUpdateTest;
GO
USE KeyUpdateTest;
GO
ALTER DATABASE KeyUpdateTest SET RECOVERY SIMPLE;
ALTER DATABASE KeyUpdateTest SET AUTO_CREATE_STATISTICS OFF;
ALTER DATABASE KeyUpdateTest SET AUTO_UPDATE_STATISTICS OFF;
GO
CREATE TABLE test (c1 INT, c2 VARCHAR (2000));
GO
CREATE CLUSTERED INDEX test_cl ON test (c1);
GO
CHECKPOINT;
GO
INSERT INTO test VALUES (1, REPLICATE ('Paul', 500));
GO
CHECKPOINT;
GO

Now we look at the page itself (and you can get the stored-proc I use below from my old blog post Inside The Storage Engine: sp_AllocationMetadata – putting undocumented system catalog views to work):

EXEC sp_allocationMetadata 'test';
GO

Object Name Index ID Alloc Unit ID     Alloc Unit Type First Page Root Page First IAM Page
———– ——– —————– ————— ———- ——— ————–
test        1        72057594039762944 IN_ROW_DATA     (1:144)    (1:144)   (1:145)

DBCC TRACEON (3604);
GO
DBCC PAGE (keyupdatetest, 1, 144, 2);
GO

Memory Dump @0x0000000016A6A000

0000000016A6A000:   01010400 00c20001 00000000 00000800 †…..Â……….
0000000016A6A010:   00000000 00000100 1c000000 bd174108 †…………½.A.
0000000016A6A020:   90000000 01000000 1d000000 8a000000 †…………Š…
0000000016A6A030:   12000000 00000000 00000000 239dbd4a †…………#.½J
0000000016A6A040:   00000000 00000000 00000000 00000000 †…………….
0000000016A6A050:   00000000 00000000 00000000 00000000 †…………….
0000000016A6A060:   30000800 01000000 03000002 001100e1 †0…………..á
0000000016A6A070:   07506175 6c506175 6c506175 6c506175 †.PaulPaulPaulPau
0000000016A6A080:   6c506175 6c506175 6c506175 6c506175 †lPaulPaulPaulPau
0000000016A6A090:   6c506175 6c506175 6c506175 6c506175 †lPaulPaulPaulPau
0000000016A6A0A0:   6c506175 6c506175 6c506175 6c506175 †lPaulPaulPaulPau
0000000016A6A0B0:   6c506175 6c506175 6c506175 6c506175 †lPaulPaulPaulPau
<snip>
0000000016A6A820:   6c506175 6c506175 6c506175 6c506175 †lPaulPaulPaulPau
0000000016A6A830:   6c506175 6c506175 6c506175 6c506175 †lPaulPaulPaulPau
0000000016A6A840:   6c000021 21212121 21212121 21212121 †l..!!!!!!!!!!!!!
0000000016A6A850:   21212121 21212121 21212121 21212121 †!!!!!!!!!!!!!!!!
<snip>
0000000016A6BFE0:   21212121 21212121 21212121 21212121 †!!!!!!!!!!!!!!!!
0000000016A6BFF0:   21212121 21212121 21212121 21216000 †!!!!!!!!!!!!!!`.

OFFSET TABLE:

Row – Offset                        
0 (0x0) – 96 (0x60)

Now I'll update the record:

UPDATE test SET c1 = 2 WHERE c1 =1;
GO

And let's look at the log records generated (in the output below I've removed all the extraneous stuff from the log that occurs before my transaction):

SELECT [Current LSN], [Operation], [Context],
 [Log Record Length], [Page ID], [Slot ID] FROM fn_dblog (NULL, NULL);
GO

Current LSN             Operation           Context               Log Record Length Page ID        Slot ID
———————– ——————- ——————— —————– ————– ———–
0000001d:00000096:0011  LOP_BEGIN_XACT      LCX_NULL              112               NULL           NULL
0000001d:00000096:0012  LOP_DELETE_ROWS     LCX_MARK_AS_GHOST     2116              0001:00000090  0
0000001d:00000096:0013  LOP_MODIFY_HEADER   LCX_PFS               76                0001:00000001  0
0000001d:00000096:0014  LOP_SET_BITS        LCX_PFS               56                0001:00000001  0
0000001d:00000096:0015  LOP_INSERT_ROWS     LCX_CLUSTERED         2120              0001:00000090  1
0000001d:00000096:0016  LOP_COMMIT_XACT     LCX_NULL              52                NULL           NULL
0000001d:000000a2:0001  LOP_EXPUNGE_ROWS    LCX_CLUSTERED         64                0001:00000090  0
0000001d:000000a2:0002  LOP_SET_BITS        LCX_PFS               56                0001:00000001  0
0000001d:000000a2:0003  LOP_MODIFY_HEADER   LCX_PFS               76                0001:00000001  0

The row is deleted and then inserted. Look at the log record lengths! And look at the slot numbers. The new row is in slot 1 (the LOP_INSERT_ROWS record) and slot 0 is ghosted (the LOP_DELETE_ROWS record), and then we see ghost cleanup running (the LOP_EXPUNGE_ROWS record), which removes the old slot 0, and the slot array shifts down so the new record is slot 0 again. See my blog post Inside the Storage Engine: Ghost cleanup in depth for info on ghost cleanup internals.

But is it in the same physical place on the page? No – a whole new record was created.

DBCC PAGE (keyupdatetest, 1, 144, 2);
GO

Memory Dump @0x0000000013CAC000

0000000013CAC000:   01010400 00800001 00000000 00000800 †…………….
0000000013CAC010:   00000000 00000100 1c000000 bd172210 †…………½.".
0000000013CAC020:   90000000 01000000 1d000000 a2000000 †…………¢…
0000000013CAC030:   01000000 76020000 00000000 239dbd4a †….v…….#.½J
0000000013CAC040:   01000000 00000000 00000000 00000000 †…………….
0000000013CAC050:   00000000 00000000 00000000 00000000 †…………….
0000000013CAC060:   3c000800 01000000 03000002 001100e1 †<…………..á
0000000013CAC070:   07506175 6c506175 6c506175 6c506175 †.PaulPaulPaulPau
0000000013CAC080:   6c506175 6c506175 6c506175 6c506175 †lPaulPaulPaulPau
0000000013CAC090:   6c506175 6c506175 6c506175 6c506175 †lPaulPaulPaulPau
<snip>
0000000013CAC820:   6c506175 6c506175 6c506175 6c506175 †lPaulPaulPaulPau
0000000013CAC830:   6c506175 6c506175 6c506175 6c506175 †lPaulPaulPaulPau
0000000013CAC840:   6c300008 00020000 00030000 02001100 †l0…………..
0000000013CAC850:   e1075061 756c5061 756c5061 756c5061 †á.PaulPaulPaulPa
0000000013CAC860:   756c5061 756c5061 756c5061 756c5061 †ulPaulPaulPaulPa
<snip>
0000000013CADFE0:   21212121 21212121 21212121 21212121 †!!!!!!!!!!!!!!!!
0000000013CADFF0:   21212121 21212121 21212121 41084108 †!!!!!!!!!!!!A.A.

OFFSET TABLE:

Row – Offset                        
0 (0x0) – 2113 (0x841)              

Notice that where the original record ended (offset 0x830 into the page). In the first page dump, everything after that is empty space. In the second page dump, the new record starts there. And look at the slot array (called the offset table by DBCC PAGE). In the first page dump the original record started at offset 0x60, and in the second dump it starts at offset 0x841. Even though ghost cleanup ran, the record did NOT migrate to it's original location.

It's very clear that an in-place update did not happen, and that the entire record is recreated in it's entirety.

In this case, the record ended up being physical slot 0 again, but that's a pathological case because there's only one slot on the page. If you repeat the experiment with two records (e.g. c1=1 and c1=2) and then update c1 of the first record to 3, you'll see the new record being physical slot 2, then after ghost cleanup becoming physical slot 1. But again, it's deleted and reinserted in it's entirety.

Now I'll show you a case where the row has to move to a new page – definitely not an in-place update!

CREATE DATABASE KeyUpdateTest;
GO
USE KeyUpdateTest;
GO
ALTER DATABASE KeyUpdateTest SET RECOVERY SIMPLE;
ALTER DATABASE KeyUpdateTest SET AUTO_CREATE_STATISTICS OFF;
ALTER DATABASE KeyUpdateTest SET AUTO_UPDATE_STATISTICS OFF;
GO
CREATE TABLE test (c1 INT, c2 VARCHAR (4000));
GO
CREATE CLUSTERED INDEX test_cl ON test (c1);
GO
CHECKPOINT;
GO
INSERT INTO test VALUES (1, REPLICATE ('Paul', 1000));
GO
INSERT INTO test VALUES (2, REPLICATE ('Erin', 1000));
GO
CHECKPOINT;
GO

I've filled up the page with two rows.

DBCC PAGE (keyupdatetest, 1, 144, 2);
GO

<snip>
Metadata: ObjectId = 2105058535      m_prevPage = (0:0)                   m_nextPage = (0:0)
pminlen = 8                          m_slotCnt = 2                        m_freeCnt = 58
m_freeData = 8130                    m_reservedCnt = 0                    m_lsn = (29:150:2)
<snip>
0000000016A6BFA0:   696e4572 696e4572 696e4572 696e4572 †inErinErinErinEr
0000000016A6BFB0:   696e4572 696e4572 696e4572 696e4572 †inErinErinErinEr
0000000016A6BFC0:   696e0000 21212121 21212121 21212121 †in..!!!!!!!!!!!!
0000000016A6BFD0:   21212121 21212121 21212121 21212121 †!!!!!!!!!!!!!!!!
0000000016A6BFE0:   21212121 21212121 21212121 21212121 †!!!!!!!!!!!!!!!!
0000000016A6BFF0:   21212121 21212121 21212121 11106000 †!!!!!!!!!!!!..`.

OFFSET TABLE:

Row – Offset                        
1 (0x1) – 4113 (0x1011)             
0 (0x0) – 96 (0x60)

Now I'll update the first row to have c1 = 3 so my row moves away from Erin's :-)

UPDATE test SET c1 = 3 WHERE c1 =1;
GO

And look at the log:

SELECT [Current LSN], [Operation], [Context],
 [Log Record Length], [Page ID], [Slot ID] FROM fn_dblog (NULL, NULL);
GO

Current LSN             Operation           Context                Log Record Length Page ID        Slot ID
———————– ——————- ———————- —————– ————– ———–
0000001d:000000a3:0015  LOP_BEGIN_XACT      LCX_NULL               136               NULL           NULL
0000001d:000000a3:0016  LOP_MODIFY_ROW      LCX_PFS                80                0001:00000001  0
0000001d:000000a3:0017  LOP_MODIFY_ROW      LCX_IAM                88                0001:00000091  0
0000001d:000000a3:0018  LOP_HOBT_DELTA      LCX_NULL               64                NULL           NULL
0000001d:000000a3:0019  LOP_FORMAT_PAGE     LCX_INDEX_INTERIOR     84                0001:00000092  -1
0000001d:000000a3:001a  LOP_INSERT_ROWS     LCX_INDEX_INTERIOR     84                0001:00000092  0
0000001d:000000a3:001b  LOP_ROOT_CHANGE     LCX_CLUSTERED          96                0001:00000082  52
0000001d:000000a3:001c  LOP_COMMIT_XACT     LCX_NULL               52                NULL           NULL
0000001d:000000a3:001d  LOP_BEGIN_XACT      LCX_NULL               120               NULL           NULL
0000001d:000000a3:001e  LOP_MODIFY_ROW      LCX_PFS                80                0001:00000001  0
0000001d:000000a3:001f  LOP_MODIFY_ROW      LCX_IAM                88                0001:00000091  0
0000001d:000000a3:0020  LOP_HOBT_DELTA      LCX_NULL               64                NULL           NULL
0000001d:000000a3:0021  LOP_FORMAT_PAGE     LCX_HEAP               84                0001:00000093  -1
0000001d:000000a3:0022  LOP_INSERT_ROWS     LCX_CLUSTERED          4092              0001:00000093  0
0000001d:000000a3:0023  LOP_DELETE_SPLIT    LCX_CLUSTERED          60                0001:00000090  1
0000001d:000000a3:0024  LOP_MODIFY_HEADER   LCX_HEAP               84                0001:00000090  0
0000001d:000000a3:0025  LOP_INSERT_ROWS     LCX_INDEX_INTERIOR     88                0001:00000092  1
0000001d:000000a3:0026  LOP_COMMIT_XACT     LCX_NULL               52                NULL           NULL
0000001d:000000a3:0027  LOP_INSERT_ROWS     LCX_CLUSTERED          4120              0001:00000093  1
0000001d:000000a3:0028  LOP_COMMIT_XACT     LCX_NULL               52                NULL           NULL

This is really interesting. Here's what happens (missing a few things for clarity):

  • 0000001d:000000a3:0016: page (1:146) is allocated to be the new root page for the clustered index
  • 0000001d:000000a3:0019: it gets minimally formatted (just the page header – although the page header is 96 bytes, there is 12 bytes of empty space at the end)
  • 0000001d:000000a3:001a: the index record pointing at (1:144) is inserted in the root page
  • 0000001d:000000a3:001b: the index root os changedd from (1:144) to (1:146) in metadata
  • 0000001d:000000a3:001e: page (1:147) is allocated to be the second data page at the leaf of the clustered index
  • 0000001d:000000a3:0022: a page split of page (1:144) occurs, moving the Erin row to slot 0 of page (1:147)
  • 0000001d:000000a3:0023: the Erin row is removed from page (1:144) – not ghosted
  • 0000001d:000000a3:0025: the index record pointing at (1:147) is inserted in the root page
  • 0000001d:000000a3:0027: the updated Paul row is inserted as slot 1 of page (1:147)

As you can see, this is as far as you can get from doing an in-place update. Page (1:144) is left with a single ghost-record on (the Paul record – the Erin record isn't ghosted because it was moved because of a split). The next access to the page will cause ghost cleanup to remove it and deallocate the page.

So no, in-place updates of key values do not happen, even though they appear to at first glance. You can experiment to show the same thing happening in nonclustered indexes too.

Another misconception busted!

PS Some of you may remember when such in-place updates really *did* happen. That was before 2005 when the Storage Engine (specifically my old Access Methods team) did the index maintenance (changing clustered and nonclustered index structures based on key value changes). A whole bunch of features in 2005 meant that we switched it over to the Query Processor to drive the index maintenance, and nifty short-cuts like we used to do aren't possible any more.

Better together – the importance of great inter-team relationships

Back at the end of January I kicked off a survey about the relationship between your DBA and dev teams. See here for the survey.

Here are the results:

 

The 22 Other values are:

  • 5 x The devs are great – it's the apps and the process that suck.
  • 4 x It depends on the development team.
  • 3 x I'm the team lead and the DBA.
  • 2 x It's fine with the internal developers, but the consultants… not so much.
  • 2 x It's non-existent. Management does not have the DBA and dev team work together until there is an issue.
  • 1 x Essentially non-existant, between all groups: Infrastructure, DEV, DBA. Management has no interest in fixing it. Total crap gets moved to prod all the time, DBA's are the bad guys cause I've stopped the practice of moving db side "things" into prod without testing.
  • 1 x It is spread out across country, DEV team think they are gods, what's a DBA?
  • 1 x The DBA/Dev relationship is good, it's the DBA/Vendor relationship that sucks because vendor apps suck!
  • 1 x The dev team have no clue what Data Architecture is all about 1The dev team is learning. Getting better all the time.
  • 1 x We are a small Admin team supporting scores of development teams. The teams are very segregated so relationships are very weak.

A very interesting variety of situations and views. Good to see that the majority of people are experiencing good relations between teams (answers 1 + 2 + 7 + 8 = 59%).

Here's a comment from the original post that sums up what I see a lot of the time when first working with clients:

The developers think the DBAs are a bunch of idle boneheads, and the DBAs think the developers are a bunch of cowboys with little regard for the availability of the live servers.

Strong words, which demonstrate a fundamental lack of understanding about what the various teams do. That is very often the root cause of all the animosity between teams – not understanding the goals and motivations of the other team. For instance, a dev team is often under a lot of pressure to get the next release finished and put into production. A DBA team needs to preserve the availability and performance characteristics of the production system. These two goals are often at odds as the dev team wants to throw code over the wall and get it deployed, whereas the DBA team wants the code to be tested to ensure it will not compromise the production system in any way.

There's also the case of those who answered #6 – one guy who's the problem between the teams. This could be someone in an architect or lead developer role, who has a lot of experience, but maybe some misconceptions, who has a domineering personality and is always right – no matter who is trying to show them that they are in fact incorrect. We've never come across this person during our consulting work, oh no. :-)

How do you work towards solving these problems?

Education.

It's no good just complaining and blaming each other when things don't go well. You need to have the other team understand *why* your team does what it does, *why* it's concerned about certain things, and *why* it operates under the pressures it does. I have a lot of experience managing development and program management teams and managing the relationships between the dev teams and the test teams, the dev teams and the build teams, the dev teams and the program management teams – so I'm speaking from experience here.

Last year in London we did a 1-day workshop where the focus was bridging the gap between development and production. We highlighted a bunch of problems we've seen over the years and how some education could have helped. Here are some of the takeaways people had from the day:

  • Avoiding known design anti-patterns – don't repeat mistakes that others have made. Once you find a performance problem, give feedback on the design anti-pattern that caused it to everyone involved. A great example of this is thinking that single-column non-clustered indexes on every table column will be useful. There are some great posts and books about SQL anti-patterns, such as this one.
  • Prototype and test design strategies – it's imperative that the developers have a test system that reproduces production workloads to the extent that proper scalability and load testing can be performed; otherwise it's not going to work in production. I see this over and over. It's also very important that developers and architects understand the ramifications of data type and schema design choices on the operational aspects of SQL Server (for instance, a LOB column in a table prevents online index operations on the clustered index).
  • Workload analysis and performance base-lining – nothing allows you to identify that a new set of code has a problem faster than having a performance baseline, and monitoring so you can easily identify that the latest code change caused a performance change.
  • Troubleshooting to identify problems rather than throwing hardware at the problem – it's always better to find the root cause of a problem rather than trying to work around it with bigger iron or faster disks. You'll eventually run into the problem again. A good example is writing a query that is non-SARGable and requires an index or table scan. I blogged about this recently here and many others have too. Implicit conversions are another great example – Jonathan has a great blog post about finding them here.
  • Build trust and partnership between teams – success is not one-sided and both teams look bad when an application/project fails. A company with teams at odds will suffer.
  • Consistent post-mortems and “lessons-learned” from problems and mistakes rather than laying blame – being able to say mea culpa and proposing changes will elevate you above your peers whether you're a dev or a DBA. Just remember that if it's someone else's fault, constructive criticism needs to be well-phrased to be well-received. Don't make it personal.

And the one we stressed the most was:

  • Work to cross-train and educate both teams for best practices

This is the most powerful way to foster relationships between teams in my experience – offering to educate the other team on, say, one issue a week. This will show them that you're willing to put skin in the game (and hopefully they'll reciprocate) and in the long-term it will save you time. And it's a great way to show initiative and that you're a team player  and motivator to your management. Some ideas for you on this:

  • After a post-mortem, put together a list of links to blog posts and articles that explain facets of the problem and how to identify and or avoid it
  • Organize a weekly self-study group that involves both teams. Use online material like our MCM prep videos as a base. Management will love this, and people will almost feel compelled to be part of it, rather than showing management that they don't care about working better together. Even the nay-sayers will eventually see the benefit and come around. Persevere!
  • Get the team leads together to voice concerns and clear the air in a pre-agreed no-offense-will-be-taken closed-door meeting. You may be surprised at what lingering misconceptions, grudges, and misunderstandings can be cleared up by letting people vent and clearing the air.
  • Bring in an outside consultant who knows what they're talking about to be an independent voice of reason with no political baggage for either team. We get asked to do this every so often – it can be challenging but very effective, especially for the "there's this one guy" problems where no-one can convince them, or wants to publicly stand up to them. Often helping the person see the light and think they've made the leap themselves is the best way forward – which can be hard to do from within the team, but a lot easier as an outsider, where we don't care about a position in the company pecking-order.
  • Send one dev and one DBA to a training class where they can both learn and bring back the learning to the teams with the two perspectives. Of all the classes we teach, the Internals and Performance class is the best for this.

To summarize, a bad working environment is toxic. You can fix it. It'll take some effort on your part but the pay-off is really worthwhile. And there are lots of resources out there to help you out.

What are you waiting for?

SQLskills hires deep technical expert Jonathan Kehayias

Today is a very exciting day for us here at SQLskills.com as we take the first step at growing as a company instead of solely partnering with other consultants with their own companies. While we value our partnerships and plan to continue many, we also value the benefits brought by having a small and dedicated team of highly technical and skilled employees.

Specifically, we’ve asked Jonathan Kehayias to become employee #3 (with Kimberly being #1 and me being #2) and we’re extremely pleased that he accepted our offer.

 

We first met Jonathan four years ago when we crossed paths on the MSDN forums where he is a prolific provider of answers (actually the top answerer on the Database Engine forum) and we’ve been casually discussing eventually hooking up at some point for well over a year. This year the stars aligned and Jonathan is making the move, starting with us mid-March.

Jonathan is the world authority on Extended Events in SQL Server 2008—he wrote the definitive whitepaper on Extended Events for Microsoft and he authored the Extended Events Manager SSMS add-in, which won the PASS 2008 SQL Hero contest. He has a thirst for knowledge and digging into the guts of SQL Server that’s truly inspiring, and he constantly impresses us with how much he knows.

[Edit 02/24/11: Jonathan just passed the MCM certification – congratulations!!!]

Jonathan is a performance tuning expert, both SQL Server and hardware, and has architected complex systems as a developer, business analyst, and DBA. This breadth of role experience, along with extensive development (T-SQL, C#, and ASP.Net), hardware and virtualization design expertise, and Windows, Active Directory, IIS, and other component knowledge makes Jonathan an incredible addition to our team. He has also presented at multiple conferences including PASS, SQLBits, and VMware Open Forum, and will be presenting at PASS SQL Rally and SQL Connections this Spring.

Jonathan is also a Drill Sergeant in the U.S. Army Reserves, and did a tour of duty in Iraq in 2004. He’s married with two small children and describes himself as a ‘general goofball’—he’s going to fit right in :-) In all seriousness though, his dedication to his family, his country and his work is without doubt extraordinary.

You can follow Jonathan on twitter and he'll be moving his blog over to SQLskills.com very soon.

One of the most interesting aspects about Jonathan being a real employee rather than a partner consultant is that we’re not limited to only working on client projects. We’re going to be working on research, benchmarks, tools, and other really useful things for the SQL community, our classes, and our clients. We’re already planning resources that will be available exclusively to our SQLskills.com community. Join now!

As you can tell, we’re really looking forward to Jonathan starting to work at SQLskills.com and we’re planning great things to come plus more additions to the team. Watch this space…

Thanks as always,

Paul and Kimberly