5/11/09: Little bit of a rewrite today as it seems some people are taking what I'm saying the wrong way.
Over on Ed Bott's blog, last week he showed some inside info about the notorious "SQL Server problem" that caused the Windows 7 RC downloads to be so slow. The reason given was that SQL Server fragmentation caused the server to spike CPU and an index rebuild fixed the problem. Then the SQL CAT team posted that the problem was (paraphrasing) that the download team had only planned for a 100% jump in traffic, and got a 500% jump. The server (me: for a so-far undisclosed reason) couldn't handle the load, so a new server was dropped in and (me: this is interesting) operational practices were changed. Today (5/11/09) I find out that there's an ongoing investigation into just why the server couldn't handle the load. Notice I'm not saying the SQL Server was deficient in any way, but that the server as a whole couldn't handle the load.
Based on the initial information, and the CAT team not stating what the actual problem was, plus they had to change an operational practice (e.g. more frequent index rebuilds to remove fragmentation), we (Kimberly and I) came up with a hypothesis as to what might have happened. Remember that saying that the server wasn't setup for a 500% jump in load could mean a very large number of things. It could mean that the raw number of inserts overloaded the hardware on the server. It could mean that the application logic didn't scale to the extra workload. It could mean that the database schema didn't scale to the extra workload. The latter is the one that we're hypothesizing.
Whatever the actual technical issue was, there are two real problems: 1) the lack of planning, whether it was in terms of hardware capacity, application logic, schema design – whatever 2) the way the problem was handled by the MS PR folks, which not only made out that it was a SQL problem, but made people like me think that of-course-it-wasn't-a-SQL-problem, but it must have been a design/schema problem.
I discussed this with Kimberly and here's what we think happened (from her initial idea). There's been some discussion on how simply downloading a package could cause fragmentation – it's just a SELECT right? Wrong. If you look at what happens when you download, there's a GUID (Globally Unique IDentifier) that gets generated for your download, and there must be a table that tracks downloads, so the GUID is entered into a table. The table most likely has a clustered index, with the random GUID value as the primary key (or the high-order key in a composite key).
Every download thus produces an insert into the clustered index at a random location in the index. As the index pages get full, they eventually need to split, so more inserts can occur in the index at points defined by the random key. These 'page splits' are very expensive, causing lots of IO, log record generation, and fragmentation (both logical fragmentation that interrupts range scan performance and reducing page density from pages becoming less full). All of this takes CPU, which explains some of the spiked CPU from all the downloads. If there are queries that are scanning this data too, the fragmentation would cause terrible performance issues, resulting in more, smaller IOs – and also adding to the CPU load. Rebuilding the index (as was explained in the original blog post) would remove the existing fragmentation, increasing performance again, but wouldn't remove the page-split issue – which is why (apparently) they're going to rebuild that index every night now.
(This is the 100-level explanation for non-SQL geeks, a more in-depth explanation can be found on Kimberly's blog at GUIDs as PRIMARY KEYs and/or the clustering key.)
One of the comments below from Andrew Kelly suggests it could also be an ad-hoc plan problem – again, not a SQL Server issue, but a problem with the application. That could equally be explained by unplanned workload blowing capacity.
So was this a SQL Server issue? No. SQL Server did exactly as it was told, and has no choice where in the index to insert the new records – the GUIDs provide the insertion point. The problem was in the developer who created that schema without testing it with a very high workload – and the lack of a database maintenance plan to pro-actively find and resolve fragmentation issues, etc, etc.
Now, this is pure conjecture, based on the facts that have come out so far. However, nothing that's been said since the issue (by the SQL team or anyone else) has made me question the hypothesis that some design/app issue led to the problem. As I said, capacity planning can mean a huge number of things – but saying that it's a capacity issue certainly doesn't mean it's NOT a schema design/app logic issue.
Once we hear what the root cause of the capacity overload problem was, I'll post more details. In the meantime, of course I'm not making out that it's a SQL Server problem – if you tell SQL Server to (for instance) use random GUIDs a the clustered, primary key – under high load it's not going to perform well. Inherent SQL Server problem? No. Human problem? Yes.
I hope this rewritten version is clearer now.
10 thoughts on “Why did the Windows 7 RC download failure happen?”
Interesting idea, not sure I can disagree. However, while the developer should have been smarter, a DBA could have pointed out that one could leave the GUID as the PK, but add a dummy auto-number column as the clustering key. The QA person could have developed/executed a better test plan. And the project manager could have made sure it all happened. So I wouldn’t lay it at the app-dev’s feet. But I agree w/"human error"
In addition to fragmentation I bet the queries were adhoc and they were also slammed by lack of plan reuse due to the guids. I saw this happen several years ago where a guid with a clustered PK on an adhoc query caused a 8 proc server to choke and die from a simple login page. The page splitting was certainly part of the troubles but the cache got choked on all the queries that were very similar except for the guid. Caused massive CPU for all the compile and searching of the proc cache.
Agreed – but assuming there was a DBA, tester, project manager… :-)
Hi Paul,
Excellent assumption but you will have access to know what really happened at MSFT…Don’t you?
In this case I don’t have any inside info.
Just some thoughts
if this is true why would they use a random GUID as a clustered index for a table that would be primarily write once? why would someone be afraid of a sequential key?
If a non sequential key were being used for some reason, wouldn’t it be best as a non clustered index?
I’m not sure why Microsoft hasn’t thought of this route to generate sequential ID’s on the fly rather than as default values on table entries. I can’t lay claim to this technique though. It was provided to me by the good folks at sqldev.net
create procedure getSeqGUID
as
create table #tempGUID(myguid uniqueidentifier default newsequentialid())
insert into #tempGUID default values
select myguid from #tempGUID
drop table #tempGUID
GO
Because using the client-side GUID as the server-side primary key is a *very* common bad practice. The app dev doesn’t realize the effect of that on the underlying storage structure. SharePoint does the exact same thing, as so many, many others.
Do you more in-depth details on this issue now?
Nope – I don’t imagine any more info will be revealed.