5/11/09: Little bit of a rewrite today as it seems some people are taking what I'm saying the wrong way.
Over on Ed Bott's blog, last week he showed some inside info about the notorious "SQL Server problem" that caused the Windows 7 RC downloads to be so slow. The reason given was that SQL Server fragmentation caused the server to spike CPU and an index rebuild fixed the problem. Then the SQL CAT team posted that the problem was (paraphrasing) that the download team had only planned for a 100% jump in traffic, and got a 500% jump. The server (me: for a so-far undisclosed reason) couldn't handle the load, so a new server was dropped in and (me: this is interesting) operational practices were changed. Today (5/11/09) I find out that there's an ongoing investigation into just why the server couldn't handle the load. Notice I'm not saying the SQL Server was deficient in any way, but that the server as a whole couldn't handle the load.
Based on the initial information, and the CAT team not stating what the actual problem was, plus they had to change an operational practice (e.g. more frequent index rebuilds to remove fragmentation), we (Kimberly and I) came up with a hypothesis as to what might have happened. Remember that saying that the server wasn't setup for a 500% jump in load could mean a very large number of things. It could mean that the raw number of inserts overloaded the hardware on the server. It could mean that the application logic didn't scale to the extra workload. It could mean that the database schema didn't scale to the extra workload. The latter is the one that we're hypothesizing.
Whatever the actual technical issue was, there are two real problems: 1) the lack of planning, whether it was in terms of hardware capacity, application logic, schema design – whatever 2) the way the problem was handled by the MS PR folks, which not only made out that it was a SQL problem, but made people like me think that of-course-it-wasn't-a-SQL-problem, but it must have been a design/schema problem.
I discussed this with Kimberly and here's what we think happened (from her initial idea). There's been some discussion on how simply downloading a package could cause fragmentation – it's just a SELECT right? Wrong. If you look at what happens when you download, there's a GUID (Globally Unique IDentifier) that gets generated for your download, and there must be a table that tracks downloads, so the GUID is entered into a table. The table most likely has a clustered index, with the random GUID value as the primary key (or the high-order key in a composite key).
Every download thus produces an insert into the clustered index at a random location in the index. As the index pages get full, they eventually need to split, so more inserts can occur in the index at points defined by the random key. These 'page splits' are very expensive, causing lots of IO, log record generation, and fragmentation (both logical fragmentation that interrupts range scan performance and reducing page density from pages becoming less full). All of this takes CPU, which explains some of the spiked CPU from all the downloads. If there are queries that are scanning this data too, the fragmentation would cause terrible performance issues, resulting in more, smaller IOs – and also adding to the CPU load. Rebuilding the index (as was explained in the original blog post) would remove the existing fragmentation, increasing performance again, but wouldn't remove the page-split issue – which is why (apparently) they're going to rebuild that index every night now.
(This is the 100-level explanation for non-SQL geeks, a more in-depth explanation can be found on Kimberly's blog at GUIDs as PRIMARY KEYs and/or the clustering key.)
One of the comments below from Andrew Kelly suggests it could also be an ad-hoc plan problem – again, not a SQL Server issue, but a problem with the application. That could equally be explained by unplanned workload blowing capacity.
So was this a SQL Server issue? No. SQL Server did exactly as it was told, and has no choice where in the index to insert the new records – the GUIDs provide the insertion point. The problem was in the developer who created that schema without testing it with a very high workload – and the lack of a database maintenance plan to pro-actively find and resolve fragmentation issues, etc, etc.
Now, this is pure conjecture, based on the facts that have come out so far. However, nothing that's been said since the issue (by the SQL team or anyone else) has made me question the hypothesis that some design/app issue led to the problem. As I said, capacity planning can mean a huge number of things – but saying that it's a capacity issue certainly doesn't mean it's NOT a schema design/app logic issue.
Once we hear what the root cause of the capacity overload problem was, I'll post more details. In the meantime, of course I'm not making out that it's a SQL Server problem – if you tell SQL Server to (for instance) use random GUIDs a the clustered, primary key – under high load it's not going to perform well. Inherent SQL Server problem? No. Human problem? Yes.
I hope this rewritten version is clearer now.