Thursday, January 31, 2008

In a previous post (see here) I debunked a myth about how much transaction log a full backup would include. I had a question in the blog post comments that asked (paraphrasing):

The full backup has to include all the transaction log from the begin LSN of the oldest active transaction at the time the data read portion of the backup ends, until the LSN at which the data read portion ends. If that begin LSN is later in time than the LSN of the checkpoint that backup does initially, why does the full backup need to include all thr transaction log between the checkpoint and the begin LSN? What is it used for?

I replied in the comments with a quip that it would be easier to reply with a whiteboard and a timeline - so I got all enthusiastic and created a picture in Powerpoint to help explain better.

Consider the timeline in the picture above for a full backup (the red numbers match the list below):

  1. The backup operation take a checkpoint to force all dirty pages in the buffer pool to disk - both those containing changes from transactions that have committed and those containing changes from transactions that are still in-flight. The backup operation then starts reading the allocated pages in the database.
  2. The read operation reads page X
  3. Transaction A starts
  4. Transaction A makes a change to page X. The copy in the backup is now out-of-date. Note that the backup will not read page X again - it's already passed that point in the database.
  5. Transaction B starts. It won't complete before the data read operation completes so it's begin LSN is the oldest active transaction begin LSN.
  6. Transaction A commits. This commits the changes to page X.
  7. The backup data read operation completes and transaction log reading starts.

Now, the reason that the transaction log is read is so that the restore operation can recover the database so it is transactionally consistent as of the point in time when the read data operation completed.

If the transaction log was only included from the oldest active transaction begin LSN (point 5), then the copy of page X that was restored from the backup (read at point 2) would not be updated with the changes from transaction A (that happened at point 4). This means that it would not be transactionally consistent with the rest of the database as of the time the read data operation completed (point 7).

So, (ignoring replication) the minimum LSN of the transaction log that's included in the full backup is MIN (LSN of last checkpoint, LSN of oldest active transaction). This ensures that recovery can REDO log records to bring pages up-to-date and UNDO log records for transactions that had not committed.

Much easier to explain with aid of a picture than without! :-)

Thursday, January 31, 2008 12:01:58 AM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 
Tuesday, January 29, 2008

Well, I bet that title grabbed a bunch of people's attention :-)

In the Beijing Advanced Research Centre offices of Microsoft where we're teaching this week, the bathroom stalls have print-outs of jokes on them. The ones in the women's bathroom are pretty good - here are a couple of them that made us laugh (thanks to Kimberly for doing the research!)

#1: A man has a girlfriend, Lorraine, who's he's been dating for a while. He's pretty happy. Into his life suddenly comes a new woman, called Clearly, who is better than Lorraine in many ways - she knows the best place in town to visit, to eat, and to dance. Unfortunately he doesn't want to give up on Lorraine though. One day they're walking along the banks of a river when Lorraine falls in and is swept away. The man starts singing 'I can see Clearly now Lorraine has gone...'

#2: A man and his wife are lying in bed, just about to fall asleep. The conversation goes like this:

Wife: 'If I were to die, would you re-marry?'

Husband: 'Of course not!'

Wife: 'Why? Don't you like being married?'

Husband: 'Well, yes - I suppose I would re-marry'

Wife: 'Would you take down all the pictures of me and replace them with her?'

Husband: 'Well yes, that would be the proper thing to do'

Wife: 'Would you let her use my golf-clubs?'

Husband: 'No, she's left handed... oh shit!'

:-)

Tuesday, January 29, 2008 9:17:51 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

This is a post I've been meaning to do for a while - detailing some of the problems you can run into when using database snapshots.

Reverting to a snapshot

Many people use snapshots as a way of protecting against mistakes during a complicated set of changes to a database - you can just revert to the snapshot if something goes wrong. However, reverting to a snapshot has a little-known (but documented) problem. Let's see. I'm going to create a database plus a snapshot of it and then start taking backups.

CREATE DATABASE SnapshotTest;
GO
ALTER DATABASE SnapshotTest SET RECOVERY FULL;
GO
BACKUP DATABASE SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest.bak' WITH INIT;
GO

CREATE TABLE SnapshotTest.dbo.MyTable (c1 INT);
GO

BACKUP LOG SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest_log.bak' WITH INIT;
GO

-- Imagine a bunch of things happen here

-- Create the snapshot database, by first checking which files exist
SELECT * FROM SnapshotTest.sys.database_files;
GO

CREATE DATABASE ST_Snap ON
(NAME = SnapshotTest, FILENAME = 'C:\SQLskills\SnapshotTest_snap.snp')
AS SNAPSHOT OF SnapshotTest;
GO

Now I'm going to do some stuff that's a mistake:

DROP TABLE SnapshotTest.dbo.MyTable;
GO

Not a problem as I can revert to my database snapshot:

RESTORE DATABASE SnapshotTest FROM DATABASE_SNAPSHOT = 'ST_Snap';
GO

Oops! I forgot to take a log backup to capture everything that happened since the last log backup...

BACKUP LOG SnapshotTest TO DISK = 'C:\SQLskills\SnapshotTest_log.bak';
GO
Msg 4214, Level 16, State 1, Line 1
BACKUP LOG cannot be performed because there is no current database backup.
Msg 3013, Level 16, State 1, Line 1
BACKUP LOG is terminating abnormally.

Reverting to a snapshot rebuilds the transaction log and breaks the log backup chain. The only thing I can do is take a full or differential backup and then start taking log backups again. So - I lost the ability to do point-in-time restores in the period from the last log backup to the time when I reverted to the snapshot.

So how can I tell when the database was reverted? Books Online documents that the restorehistory table in msdb should have an entry with restore_type = 'R'. Let's try:

SELECT * FROM msdb.dbo.restorehistory WHERE [destination_database_name] = 'SnapshotTest';
GO

Nope - nothing. That functionality seems to be broken. The only way I could find to trace reverting was in the error log:

2008-01-30 11:09:21.73 spid53      Reverting database 'SnapshotTest' to the point in time of database snapshot 'ST_Snap' with split point LSN 26000000013800001 (0x0000001a:0000008a:0001). This is an informational message only. No user action is required.
2008-01-30 11:09:21.74 spid53      Starting up database 'SnapshotTest'.
2008-01-30 11:09:21.74 spid53      The database 'SnapshotTest' is marked RESTORING and is in a state that does not allow recovery to be run.
2008-01-30 11:09:21.85 spid53      Starting up database 'SnapshotTest'.
2008-01-30 11:09:21.87 spid53      Starting up database 'ST_Snap'.

Note that the snapshot database is still there and continues to work.

Running out of space in a snapshot

The next problem that can happen is if you create a database snapshot on a volume that doesn't have much disk space. In that case you can actually run out of space and the snapshot will go suspect and be unusable until it's dropped and recreated. Let's see what happens:

-- Create snapshot of large database on a volume with not much space
SELECT * FROM SalesDB.sys.database_files;
GO

CREATE DATABASE SalesDB_Snap ON
(NAME = SalesDBData, FILENAME = 'D:\sqlskills\SalesDB_snap.snp')
AS SNAPSHOT OF SalesDB;
GO

ALTER INDEX ALL ON SalesDB.dbo.Sales REBUILD;
GO
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'D:\sqlskills\SalesDB_snap.snp' failed due to lack of disk space.

The query that causes the sparse file to run out of space doesn't fail, but the user running the query will see this error. Note that it doesn't say the snapshot has gone suspect. If you try to use it then it will tell you, plus there info written to the error log:

2008-01-30 11:50:29.14 spid53      Error: 17053, Severity: 16, State: 1.
2008-01-30 11:50:29.14 spid53      D:\sqlskills\SalesDB_snap.snp: Operating system error 112(There is not enough space on the disk.) encountered.
2008-01-30 11:50:29.15 spid53      Error: 3420, Severity: 21, State: 1.
2008-01-30 11:50:29.15 spid53      Database snapshot 'SalesDB_Snap' has failed an IO operation and is marked suspect.  It must be dropped and recreated.

Running out of space while running DBCC CHECKDB

One thing that isn't common knowledge is that DBCC CHECKDB uses database snapshots as it's mechanism for running online in SQL Server 2005. You can't control where the snapshot is created (it's actually created in the same location as the files comprising the database being checked - see my previous post on the DBCC CHECKDB stages here for more info) but you can create your own snapshot and check that. Anyway - it's possible to run out of space in the snapshot while DBCC CHECKDB is running. I setup a situation with a large database on a drive with only 1MB of space left, started a large index rebuild in the database and then tried to run DBCC CHECKDB. Let's see what happened:

The index rebuild statement:

ALTER INDEX ALL ON SalesDB2.dbo.Sales REBUILD;
GO
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'd:\sqlskills\salesdbdata.mdf:MSSQL_DBCC20' failed due to lack of disk space.

So some poor unsuspecting user gets this error and has no idea why. For DBAs its a bit obscure too. The syntax is describing an alternate stream on the existing data file and the snapshot in question can't be accessed at all.

The DBCC CHECKDB statement:

DBCC CHECKDB (salesdb2) WITH NO_INFOMSGS;
GO
Msg 1823, Level 16, State 1, Line 1
A database snapshot cannot be created because it failed to start.
Msg 1823, Level 16, State 2, Line 1
A database snapshot cannot be created because it failed to start.
Msg 7928, Level 16, State 1, Line 1
The database snapshot for online checks could not be created. Either the reason is given in a previous error or one of the underlying volumes does not support sparse files or alternate streams. Attempting to get exclusive access to run checks offline.
Msg 5128, Level 17, State 2, Line 1
Write to sparse file 'd:\sqlskills\salesdbdata.mdf:MSSQL_DBCC20' failed due to lack of disk space.
Msg 3313, Level 21, State 2, Line 1
During redoing of a logged operation in database 'salesdb2', an error occurred at log record ID (1628:252:1). Typically, the specific failure is previously logged as an error in the Windows Event Log service. Restore the database from a full backup, or repair the database.
Msg 0, Level 20, State 0, Line 0
A severe error occurred on the current command.  The results, if any, should be discarded.

Wow - major amounts of errors - and it all boils down to running out of space.

Tuesday, January 29, 2008 8:26:11 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 

This has been causing some problems on the various groups and forums over the last few days so I thought I'd repost this from my old Storage Engine blog. The questions have been around attaching 2005 databases to 2000 servers - even databases that are in 80 compat mode - and it doesn't work. Why?

The confusion is between database compatibility level and database version. Here's a quick explanation of the difference.

Database version

The database version is a number stamped in the boot page of a database that indicates the SQL Server version of the most recent SQL Server instance the database was attached to. The database version number does not equal the SQL Server version. For example, doing the following:

SELECT @@version;
GO

on one SQL Server instance on my laptop returns:

Microsoft SQL Server 2005 - 9.00.3054.00 (Intel X86)   Feb 13 2007 23:02:48   Copyright (c) 1988-2005 Microsoft Corporation  Developer Edition on Windows NT 5.1 (Build 2600: Service Pack 2)

However, the database version is 611. You can see the database version but if you attach a database from an earlier version of SQL Server, you'll see these numbers in the error log as SQL Server reports what upgrade steps its doing. You can also see by doing the following:

USE master;
GO

SELECT DatabaseProperty ('dbccpagetest', 'version');
GO

Some things to note about database version:

  • SQL Server is not up-level compatible. You cannot attach a database that was created on (or has been upgraded to) SQL Server 2005 to any earlier version of SQL Server (also true for trying to attach a 2000 database to 7.0, and so on).
  • You cannot attach a database that was created on an earlier version without going through the proper upgrade procedures. Forcibly attaching a database using various hacky methods will result in all kinds of weird errors, and possibly crashes.

Database compatibility level

The database compatibility level determines how certain database behaviors work. For instance, in 90 compatibility, you need to use the OUTER JOIN syntax to do an outer join, whereas in earlier compatibility levels, you can use '*=' and '=*'. Contrary to popular myth, all of the behavioral differences ARE documented - in the Books Online section for sp_dbcmptlevel - the SP used to set the compatibility level.

There are 5 supported compatibility levels support by SQL Server 2005:

60 = SQL Server 6.0

65 = SQL Server 6.5

70 = SQL Server 7.0

80 = SQL Server 2000

90 = SQL Server 2005

You can see the compatibility level of all databases by doing:

SELECT name AS 'DB Name', compatibility_level AS 'Compatibility Level'
FROM master.sys.databases;
GO

Some things to note about compatibility levels:

  • A database created on SQL Server 2005 will have a default compatibility level of 90, unless the model database has a different compatibility level, in which case the new database inherits the compatibility level of model.
  • New features may work under older compatibility levels but beware of SET options.
  • An upgraded database retains its compatibility level. For example, a database that was created on SQL Server 2000, and didn't have its compatibility level altered, will stay in 80 compatibility level when its upgraded to SQL Server 2005.

Summary

This was just a quick - and by no means comprehensive - explanation of the difference between the two terms. Basically, there's no relationship between them.

Tuesday, January 29, 2008 3:51:14 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Monday, January 28, 2008

A bit more traffic on the thread (see previous post here) prompted me to give my thoughts on the many sweeping generalizations that plague the computer industry and make it difficult sometimes to give advice in forums and blogs. I'd like to repost here (with a few tweaks for clarity).

Some examples of questions that breed sweeping generalizations:

  • Should you have clustered indexes on all tables? The well-known clustered-index debate as Kimberly likes to call it.
  • Should you rebuild or reorganize indexes to remove fragmentation?
  • Which high-availabilty solution should you use?

The problem - as with most advice - is that it's extremely hard to make generalizations. This is both because:

  1. without lots of evidence many people (quite rightly) don't believe sweeping generalizations as they may have been bitten by one in the past
  2. nearly every situation is different so many generalizations are useless

What I'd love to see, (and I tried to do this when at MS, and like to think I do it here or when teaching classes or conferences) is for people to provide the justification for generalizations, plus some idea of the exceptions and the circumstances under which they arise.

As for this case (whether to create multiple files because there are multiple cores/CPUs), I think we've about done this one to death. The sweeping generalizations here are:

  1. for non-tempdb you usually don't need multiple files, unless you have a very high-end workload of the specific nature I described in my first post (rare)
  2. for tempdb you usually do, as long as your workload merits it on a multi-core/cpu box
  3. IO vendors may recommend it for increased IO throughput *on their specific hardware*
  4. there exist sweeping generalizations from various sources that dispute all of the above

Unfortunately, you're not going to get a definitive, authoritative answer to a design/strategy question such as this and you'll continue to find contradictions to anything anyone says on the forums, and even MS contradicting itself (sigh).

What I would suggest is the following:
1) go with the majority opinion of responses to questions asked, based on the respondents collective experience with many customers, databases, and workloads
2) do your own testing, on your own hardware, with your own workload and see what works for you (but beware that testing in a vacuum can prove or disprove anything you want - which is why you see so many contradictory statements)

One last thing on MS - it's a very big company, with lots of groups. Anyone can sponsor a whitepaper, write a blog post/MSDN article/technet article and publish it, or reply on a forum as a visible MS person and it has the 'official stamp' of coming from MS. When I was in the product group I was continually dismayed by the misinformation, bad advice, contradictions, and baseless assertions that I saw coming from MS employees who were just trying to be helpful.

Once something's published on the internet, it's *incredibly* hard to undo the damage done. There's a fundamental element of mistrust sometimes on forums and newsgroups which can be wearying when you're trying to help people out. It can be very hard to convince people that someone else's advice isn't the best to follow - I remember several times arguing with people about how CHECKDB works or what a corruption error message means and finally having to resort to 'I wrote that code - I'm afraid you *are* wrong' - which I really hate doing.

Anyway - rant over :-)

Monday, January 28, 2008 9:35:24 PM (Pacific Standard Time, UTC-08:00)  #    Comments [5]  | 

There's been a very interesting discussion going on over at SQLServerCentral.com about whether to create multiple files for a user database because the server has multiple CPUs - see the thread here. I wrote a couple of long responses during the thread that I wanted to duplicate here as I think it's of broad interest.

My first response was:

Doesn't make sense to split any database into multiple files for per-cpu performance, except tempdb which can suffer from multiple-CPUs all trying to modify the same allocation bitmaps under high load with small temp tables being created/deleted (see http://www.microsoft.com/technet/prodtechnol/sql/2005/workingwithtempdb.mspx for details).

Now, saying that, there's an exception - and that's when a non-tempdb database has the same allocation bitmap contention issues, but that only happens on the very high-end with thousands of single-row inserts per second happening on each CPU. It's pretty rare. I've never seen it but Kimberly has.

What you should be considering is splitting a database into filegroups to be able to isolate critical tables within separate filegroups, or splitting a partitioned table across multiple filegroups. For SS2005, these aren't for performance but instead for manageability (e.g. being able to restore a single filegroup offline while the rest of the database remains online, or setting several partitions of a table to be read-only). For SS2008, there are improvements to query plan generation over multiple partitions that could allow for better parallelism, but its still not optimal compared to non-partitioned data.

So - complex topic and these are simple guidelines. Hope they help.

This was followed by a discussion pointing out that various hardware vendors imply the opposite, as do several MS sources. So here was my latest response:

Hi folks,

What's really not made clear in the MS literature is that the multiple-files in a database really only applies to tempdb. Even on tempdb, on 2005 you don't need one file per core - more like 1/4 -1/2 the number of files as there are cores.

The tempdb problem is this - common workloads create and drop many worktables per second. The allocations that happen for a table are initially single-pages (as opposed to whole extents). This means that a search of the SGAM page for the 4GB interval needs to be done to find a mixed-extent with a free page to allocate. Multiple CPUs all hammering this page cause contention on it and performance problems. Then, a page needs to be allocated for the first IAM page - same thing happens. Then these pages need to be marked allocated in a PFS page - same thing happens. And then these pages need to inserted into the sysindexes row for the tabel - more contention. On 2000 this was particularly bad - so T1118 plus multiple files was the solution, where SQL Server would round-robin the single page allocations in the files in tempdb, alleviating the contention somewhat.

In SQL Server 2005, we changed the temp table mechanism so that whenever a temp table is dropped, one data page, one IAM page, and the system table entries (no longer sysindexes, but instead is a 'hidden' table called sys.allocation_units) are cached. When a new temp table is allocated, if there's a cached 'template temp table' it is picked up and used without so much contention on the various allocation bitmaps. On a heavily loaded system there can still be contention and so you still need multiple files for an SMP box, but just not so many. And you don't need T1118 any more.

So - this is more prevalent on tempdb, but CAN happen on a user database under extreme load on monster hardware. Testing should show whether this is happening to you - if not, don't create multiple files for performance.

Now, in terms of what works for your particular vendors IO scalability - that's beyond me and you may need to think about that if its something they recommend. However - I'd still take it with a pinch of salt and do your own testing. See the whitepaper at http://www.microsoft.com/technet/prodtechnol/sql/bestpractice/pdpliobp.mspx for some testing/banchmarking info.

Hope this helps clarify a bit more - great discussion!

I'm interested to hear any other opinions on this - especially cases where you've found it necessary to create multiple files for performance.

Thanks!

Monday, January 28, 2008 6:10:21 PM (Pacific Standard Time, UTC-08:00)  #    Comments [3]  | 

This kind of follows on from my previous post about making sure you have character column lengths that can handle data from different countries (e.g. city names that may be longer in one country than another). A question on the forums today asked what info there is available to help in designing a global-ready database.

It turns out that there's a wealth of information right there under your nose - type in 'globalization' in the Index of Books Online. It'll get you to a section titled 'International Considerations for SQL Server' that has a link to a sub-section for every component of SQL Server! Very impressive. For instance, the one for the Database Engine has everything you need (I've made these links to the latest online BOL entries on MSDN):

Check it out and save yourself some pain when your database/application suddenly needs to support customers outside your home country.

Monday, January 28, 2008 12:33:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Sunday, January 27, 2008

Microsoft has announced that the RTM date of SQL Server 2008 has slipped out to Q3 of this year. See this official blog post here. The official launch will still go ahead as planned on February 27th.

While some people may see this as disappointing, I don't see many large customers going straight into SS2008 when it comes out so I, for one, am glad they're taking the time to ensure a high-quality release.

Sunday, January 27, 2008 5:59:02 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

Start of the second week in China. We flew up from Shanghai to Beijing yesterday, and just in time it seems. As we flew out of Shanghai it was being enveloped in a snowstorm. Beijing's much colder than Shanghai though - being further north and inland quite a way compared to Shanghai.

As I said a few days ago, the last time I was here I stayed in the Jiu Hua International Conference & Exhibition Center Hotel, which has to be one of the worst hotel experiences ever. I remember going to the buffet breakfast one day and picking up a pork bun to find some dead moths stuck to the bottom - that was my breakfast appetite gone for the day. Whenever I had laundry done they demanded cash before they'd give it back to me - no such thing as charging to the room. This time we're in the Grand Hyatt Beijing which is right in the middle of the city and is far more pleasant. Our room's not so high as in Shanghai as the height of buildings in the old part of Beijing is limited. Here's the view from our room this afternoon. (Click on it for a bigger version)

It's looking out on the Peking Union Medical College Hospital, formerly the Yuwang Residence (what our guide said was a house/palace for a prince). The hotel's just a few blocks from the Forbidden City and Tiananmen Square, and we can see into the Forbidden City (to the left of the view above). We're going to do a tour of these on Friday, along with the stunning Temple of Heaven, before catching the late flight back to Vancouver and on to Seattle.

Today was a free day for us so we decided to take a tour out to the Great Wall - something we were both extremely excited to do and somewhere I've wanted to go since I was a child. We decided to go to the Mutianyu section about an hour outside Beijing as we'd been told it's the best section to see - and boy, where they right! The Wall is just *stunning*. Pictures or words don't do it justice. If you're ever in the area, I urge you to make time to go if it's the only thing you see in Beijing. I really think it has to be one of the most (if not *the* most) incredible things I've ever seen. Wow.

Well, now it's dinner time so I'll leave you with a shot of Kimberly and I on the Great Wall. (Again, click for a larger version.) The wall actually continues on over to the mountains at the top left of the picture - apparently some of the sections of wall are so steep and remote that they haven't been visited for a *long* time. It's amazing to think that all the rocks - some weighing up to a ton - were carried up the mountains by hand...

 

Sunday, January 27, 2008 1:36:01 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Saturday, January 26, 2008

It seems like all I've been talking about on the forums the last couple of days is the correct order of operations in a maintenance plan. The biggest confusion seems to be about the effect of rebuilding an index on statistics, with some incorrect advice being given out on what to do.

Rebuilding an index will update statistics with the equivalent of a full scan - doesn't matter whether you use DBCC DBREINDEX or ALTER INDEX ... REBUILD, the effect is the same. It can do this because it sees a complete picture of the index while its doing the rebuild.

Reorganizing an index (using the old DBCC INDEXDEFRAG I wrote, or the new ALTER INDEX ... REORGANIZE) will NOT update statistics at all, because it only sees a few pages of the index at a time.

The problem I've been seeing is people rebuilding indexes and then updating statistics. So why is this a problem? Well, it depends :-)

  • If your default for updating statistics is to do a sampled scan, then having rebuild update the statistics with a full scan (as a side effect), and then proceeding to manually update them again with the default sampled scan, means that after both operations you're left with sampled statistics. You've wasted resources doing the sampled scan AND you've lost the 'free' full-scan statistics that the index rebuild did for you.
  • If your default is to do a full scan, then you don't lose out on having the best statistics, but you do waste resources by unnecessarily updating statistics a second time.

So what's the solution?

The simple answer is not to update statistics on indexes that have just been rebuilt.

The more complicated answer is to:

  1. Have a list of indexes (F) that you know will cause workload performance problems if they get fragmented
  2. Have a list of indexes (S) that you know will cause workload performance problems if the statistics don't get regularly updated
  3. Evaluate fragmentation for list F, and choose to reorganize, rebuild, or do nothing
  4. For all indexes in list S that were not rebuilt in step 3, update statistics

Hope this helps.

Saturday, January 26, 2008 4:01:41 PM (Pacific Standard Time, UTC-08:00)  #    Comments [8]  | 
Friday, January 25, 2008

Well it's the end of our first week in China - tomorrow we fly to Beijing for another week of teaching. Hopefully we'll get to do some sightseeing on Sunday - the last time I was in Beijing (for TechEd '06) the conference hotel was about 25 miles outside the city centre - so I haven't really been to Beijing yet.

During this week I've been playing with FILESTREAM for some demos I'm writing for a Microsoft class I'll be teaching when I get back to Redmond. The class is about SQL Server 2008 for DBAs and the attendees will be a bunch of Microsoft Field personnel and SQL Server MVPs. Anyway, as I was playing, I realized that I hadn't blogged anything about FILESTREAM yet, so this is the start of a series of posts about the feature. First up - what is it?

One problem that SQL Server users face is how to store related structured and unstructured data (BLOBs) together while:

  • Providing a way to keep the data in sync (transactionally consistent)
  • Providing fast streaming access to the BLOBs
  • Keeping costs low
  • Enabling scalability
  • Providing ease of management

There's also been a strong desire for a data type that supports BLOB values greate than 2GB for many years.

Before SQL Server 2008, the solutions centered around:

  1. Storing the BLOBs in the file system
    • Advantages: low cost per GB; great streaming performance
    • Disadvantages: difficult to integrate with structured data; complex app development/deployment
  2. Storing the BLOBs on a dedicated BLOB store
    • Advantages: good scale/expandability; cost decreases as scale increases
    • Disadvantages: difficult to integrate with structured data; complex app development/deployment; streaming support/performance is solution dependent
  3. Storing the BLOBs in a database
    • Advantages: integrated management; BLOBs integrated with structured data; easy app development/deployment
    • Disadvantages: high cost per GB; poor streaming performance; 2GB size limit per BLOB

Enter FILESTREAM. It provides the following:

  • BLOB data is stored in the file system but rooted in the database (in the table of which it is part)
  • BLOB data is kept transactionally consistent with structured data
  • BLOB data is accessible through T-SQL and the NTFS Streaming APIs - with great streaming performance
  • BLOB size is limited only by the NTFS volume size
  • Manageability is integrated into SQL Server

Sounds pretty good eh? Well, it is - mostly. There are a few drawbacks with v1 however:

  • Database mirroring cannot be configured on databases with FILESTEAM data
  • Database snapshots don't snapshot FILESTEAM data
  • FILESTREAM data can't be natively encrypted by SQL Server

Personally, I think the first of these could be a major barrier to adoption in the enterprise - hopefully this restriction will be lifted in v2.

So - there's a taster. In the next few posts I'll detail how to enable FILESTREAM and create FILESTREAM data.

Friday, January 25, 2008 1:51:30 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Monday, January 21, 2008

Nothing technical in this blog post for a change, just some info about the first trip of the year for us. We're in China for two weeks, teaching classes at the Microsoft campuses in Shanghai and Beijing. We flew in Sunday afternoon on a direct flight from Vancouver, Canada which was just wonderful. When I came to Shanghai in September 2006 for TechEd I flew to Tokyo first and then on to Shanghai but the direct flight is great. We flew Air Canada and their business class beats the crap out of United's, which we usually fly - much bigger seats with much more legroom.

Last time I was here I stayed at the Le Meridien She Shan which was about 20 miles outside the city center. This time we booked a room at the Grand Hyatt Shanghai which is right in the middle of the city. The hotel is amazing - it starts on the 53rd floor of the Jin Mao Tower in Pudong - the new side of Shanghai. Our room is on the 77th floor looking out over the Huangpu River and over towards The Bund. Here's the view from our room this morning - pretty stunning! (Click on it for a bigger version)

The space-age looking tower on the left is the Oriental Pearl TV Tower - now a famous Shanghai landmark.

Today was our only free day in Shanghai so we decided to sleep in, have breakfast, and then take a private half-day tour of the city. We had a great guide (Marco) who took us round a few cool places and was a fountain of knowledge about Shanghai and Chinese culture. First up was the 400 year old Yuyuan Gardens in the old part of the city (built by a government official as a present to his elderly parents) and then on to the Jade Buddha Temple (built in 1882 to house two solid jade statues brought back by a monk from Burma) - quite different to the Hindu temples we'd seen on our dive trip to Bali at the end of 2006.

Later in the week I've got some good blog posts lined up on new 2008 features. For now, it's dinner time so I'll leave you with a shot of Kimberly and I just before heading into the Yuyuan Garden. (Again, click for a larger version)

 

Monday, January 21, 2008 1:23:25 AM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Thursday, January 17, 2008

One of the cool features of SQL Server 2008 for developers is spatial data support. There have been some great posts recently about using spatial (see Simon Sabin's blog here), which is all developer stuff - but what I'm interested in are the implications of spatial support for DBAs, and they are focused on spatial indexes

Note: all images in this post are taken from November CTP Books Online

There are two kinds of spatial data that 2008 supports - planar (i.e. points, lines, polygons on a single 2-D plane) and geodetic (i.e. points, lines, polygons on a geodetic ellipsoid - for example, the Earth). These are presented in SQL as geometry and geography data respectively. A common operation that's performed on spatial data is comparing two spatial values to see if they intersect at all. Now, this is a complicated calculation, which gets more computationally expensive as the complexity of the spatial values increase. Given a problem of 'which spatial values in this table does this spatial value X intersect with', it would be great to have some way of quickly pruning out spatial values in the table that cannot possibly intersect with X, and so avoid doing the expensive calculation for them. Enter spatial indexes.

Here's the basic idea behind a spatial index:

  • A plane is broken up into a grid of cells.
  • Each spatial value is evaluated to see which cells in the grid it intersects with
  • The list of cells is stored along with the primary key of the table row that the spatial value is part of
  • Comparing two spatial values for intersection is a matter of comparing the list of grid cells - if there are no matching cells, the spatial values do not intersect, and there's no need to do the expensive intersection calculation

In practice its a bit more complicated. For planar data (i.e. the geometry data type), you need to define a bounding box (i.e. 4 corner points that define a rectangle of space in which you're interested on the 2-D plane). That bounding box on the plane will be broken down into a grid of cells. The top-level grid can be up to 16x16, giving 256 cells. The next level of granularity breaks each of those top-level grid cells into a further grid, again up to 16x16. So now there could be (16x16) x (16x16) cells in the grid - or 65536 cells. This obviously allows a more exact description of a spatial value in the list of cells. And so on and so on. There are actually 4 levels of grid that the bounding box is broken up into - and each can be 16x16, for a possible total of 168 or 4 billion cells. The picture below illustrates this with a grid size of 4x4 at each level.

The bounding box and the size of the grid at each level are specified when the spatial index is created, as well as the maximum number of matching grid cells to store in the spatial index per spatial value - to a max of 8192. Once the bounding box has been decomposed into the various levels of grid, each value in the spatial column is evaluated against the grid. The value is first decomposed against the first level grid. If the number of cells it matches is less than the max per spatial value, the decomposition then moves to the second level grid. This decomposition continues until the maximum number of matching grid cells is reached. If the max is reached while processing a deeper level for a cell, (e.g. in the middle of processing the 2nd level grid of 4x4 for cell #13 in the 1st level), the deeper level matches are thrown away and only the coarser granularity matching cell is stored (e.g. continuing that example, the 2nd level grid matches are discarded and only cell #13 in the 1st level will be stored). The picture below helps to illustrate this.

So, each geometric spatial value is approximated in the spatial index by a list of matching cells in the defined bounding box. As there is a limit to the number of matching cells that can be stored in the approximation, it is an optimistic representation. This means that if two values are compared using the approximations, there will be no false negatives, as the approximations map a larger space than the actual spatial values. There can, however, be false positives. A false (or real) positive means the spatial values need to then be compared using the complex, computationally expensive intersection algorithm using the actual spatial values. So again, the spatial index serves as a way of pruning out the need to run the expensive algorithm.

The algorithm is very similar for geodetic data (i.e. the geography data type), however there's no bounding box. Instead, the entire geodetic ellipsoid is projected onto a 2-D plane and then the grid decomposition algorithm is applied to that plane in exactly the same way as for planar data. The picture below describes how the projection is done.

You may have already realized that the effectiveness of the spatial index in pruning is directly proportional to how exactly the approximations in the index actually describe the spatial values. In other words, the higher the number of grid cells at each level, and the higher the number of grid cell matches that are stored per spatial value in the index, the better the index is at pruning. More exact approximations require storing more matching grid cells at deeper granularities - i.e. taking MORE SPACE. Creating a more exact spatial index takes more space.

With all that in mind, the interesting thing for DBAs here is that there's a trade-off between CPU use to do the real intersection algorithm and spatial index size to use in pruning calls to the algorithm. It's too early to know what best practices there are - but I'll blog them as I here about them.

Thursday, January 17, 2008 11:59:06 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, January 16, 2008

Despite the fact that I was in the Storage Engine, and there's always been humorous rivalry between the Storage Engine team and the Relational Engine (a.k.a. the Query Processor) team, I did actually get along with some of the QP guys :-)

One of my good friends, Conor Cunningham, has been wanting to get back into blogging and we're extremely pleased that he's now blogging on SQLskills.com - see his new blog at http://www.SQLskills.com/blogs/conor. Before Conor left Microsoft last year to head home to Texas (the Seattle rain gets the non-natives or rain-hardened Scots eventually), he was the Development Lead of the Query Optimizer team and influenced much of the Query Processor architecture. Conor's probably forgotten more about how the Optimizer works than I'll ever know! :-)

So - I for one will be following his blog avidly.

Welcome Conor!

Wednesday, January 16, 2008 9:48:14 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Monday, January 14, 2008

Back in October 2007 I blogged about partition-level lock escalation in SQL Server 2008 (see here) and I promised to do a follow-up once CTP-5 came out with syntax etc. So here it is.

A brief recap - lock escalation in SQL Server 2005 and before only allowed table-level lock escalation. If you have a partitioned table with queries going against different partitions, then table-level escalation is a pain because the whole table is suddenly locked and concurrent queries against distinct partitions can't run. SQL Server 2008 gives the ability to escalate to a parttition lock, which won't affect the queries on the other partitions.

The lock escalation policy can only be set with ALTER TABLE after a table has been created, and the policy can only be set at the table level. The syntax is

ALTER TABLE TableName SET (LOCK_ESCALATION = TABLE | AUTO | DISABLE);

The options mean:

  • TABLE - escalation will always be to the table level. This is the default.
  • AUTO - escalation will be to the partition level if the table is partitioned; otherwise it will be to the table level
  • DISABLE - escalation will be disabled. This does not guarantee that it will NEVER occur - there are some cases where it is necessary (Books Online gives the example of scanning a heap in the SERIALIZABLE isolation level)

The only way I could find to check what the escalation policy for a table is set to is to use the sys.tables catalog view:

SELECT lock_escalation_desc FROM sys.mytables WHERE name = 'TableName';

Let's try it out. Here's a script that creates a database with an example table with 3 partitions. The partition ranges are negative infinity to 7999, 8000 to 15999, 16000 to positive infinity.

CREATE DATABASE LockEscalationTest;
GO

USE LockEscalationTest;
GO

-- Create three partitions: -7999, 8000-15999, 16000+
CREATE PARTITION FUNCTION MyPartitionFunction (INT) AS RANGE RIGHT FOR VALUES (8000, 16000);
GO

CREATE PARTITION SCHEME MyPartitionScheme AS PARTITION MyPartitionFunction
ALL TO ([PRIMARY]);
GO

-- Create a partitioned table
CREATE TABLE MyPartitionedTable (c1 INT);
GO

CREATE CLUSTERED INDEX MPT_Clust ON MyPartitionedTable (c1)
ON MyPartitionScheme (c1);
GO

-- Fill the table
SET NOCOUNT ON;
GO

DECLARE @a INT = 1;
WHILE (@a < 17000)
BEGIN
INSERT INTO MyPartitionedTable VALUES (@a);
SELECT @a = @a + 1;
END;
GO

Now I'm going to explicitly set the escalation to TABLE and start a transaction that should cause lock escalation.

ALTER TABLE MyPartitionedTable SET (LOCK_ESCALATION = TABLE);
GO

BEGIN TRAN
UPDATE
MyPartitionedTable SET c1 = c1 WHERE c1 < 7500;
GO

We should be able to see the locks being held:

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE';
GO

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    X              LOCK           GRANT

Just as we expected - an X table lock. Trying any query against the table fails now. Now I'll rollback that transaction, set the escalation to partition-level and try again.

ROLLBACK TRAN;
GO

ALTER TABLE MyPartitionedTable SET (LOCK_ESCALATION = AUTO);
GO

BEGIN TRAN
UPDATE
MyPartitionedTable SET c1 = c1 WHERE c1 < 7500;
GO

SELECT [partition_id], [object_id], [index_id], [partition_number]
FROM sys.partitions WHERE object_id = OBJECT_ID ('MyPartitionedTable');
GO

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE';
GO

partition_id         object_id   index_id    partition_number
-------------------- ----------- ----------- ----------------
72057594039042048    2105058535  1           1
72057594039107584    2105058535  1           2
72057594039173120    2105058535  1           3

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
HOBT            72057594039042048             X              LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT

Excellent - the object lock is now IX rather than X, and the X lock is at the partition (HOBT) level for partition 1 (see the bold highlighting to match the partition ID with the lock resource). (For an explanation of HOBTs, see my post Inside The Storage Engine: IAM pages, IAM chains, and allocation units.) So now we should be able to do something with another partition - let's see if we can cause another partition level X lock in another connection:

USE LockEscalationTest;
GO

BEGIN TRAN
UPDATE
MyPartitionedTable set c1 = c1 WHERE c1 > 8100 AND c1 < 15900
;
GO

SELECT [partition_id], [object_id], [index_id], [partition_number]
FROM sys.partitions WHERE object_id = OBJECT_ID ('MyPartitionedTable'
);
GO

SELECT [resource_type], [resource_associated_entity_id], [request_mode],
[request_type], [request_status] FROM sys.dm_tran_locks WHERE [resource_type] <> 'DATABASE'
;
GO

partition_id         object_id   index_id    partition_number
-------------------- ----------- ----------- ----------------
72057594039042048    2105058535  1           1
72057594039107584    2105058535  1           2
72057594039173120    2105058535  1           3

resource_type   resource_associated_entity_id request_mode   request_type   request_status
--------------- ----------------------------- -------------- -------------- ----------------
HOBT            72057594039107584             X              LOCK           GRANT
HOBT            72057594039042048             X              LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
METADATA        0                             Sch-S          LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT
OBJECT          2105058535                    IX             LOCK           GRANT

Now we have two partition X locks, for partitions 1 and 2 (as expected - use the color coding above to match up the IDs), plus two table-level IX locks (one for each  connection, as expected). Very cool!

Now I'm going to force a deadlock - by having each connection try to read a row from the other locked partition:

Connection 1:

SELECT * FROM MyPartitionedTable WHERE c1 = 8500;
GO

Conneciton 2:

SELECT * FROM MyPartitionedTable WHERE c1 = 100;
GO

Connection 2 succeeds but on connection 1 we get (as expected):

(local)\SQLDEV01(SQLHAVPC\Administrator): Msg 1205, Level 13, State 18, Line 1
Transaction (Process ID 51) was deadlocked on lock resources with another process and has been chosen as the deadlock victim. Rerun the transaction.

This illustrates a potential problem with this new mechanism - applications that used to rely on the blocking nature of X table locks may now exhibit deadlocks if partition-level escalation is turned on in production without any testing. In fact, this mode was specifically chosen NOT to be the default setting for new tables because some trial workloads exhibited deadlocks during testing. Don't just turn it on in production without testing - as with any other option or feature.

Monday, January 14, 2008 2:10:24 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Saturday, January 12, 2008

SQLskills.com has released its first ever utility! :-)

After several years of being asked to make available the Dual Database Monitor application (that's present inside the popular AlwaysOn Hands-On Labs we give out), I've spent a bunch of time making it configurable, work outside the VPC environment, and available in an easy kit form for people to use. It comes with easy-to-follow instructions and example SQL scripts to get you going. It supports SQL Server 2005 and SQL Server 2008.

This is an invaluable tool to use when setting up a Database Mirroring partnership to check that mirroring is working ok, without having to hack up your own application to do it. It can also be used to monitor two nodes of a peer-to-peer replication topology (or more using multiple instances of the monitor). It comes in two versions - single-user or unlimited use within a single company.

Check out the DDM webpage here for more details and ordering info.

Saturday, January 12, 2008 11:28:58 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, January 10, 2008

Just got an email notification of a new whitepaper from the SQL Customer Advisory Team on Database Mirroring and Log Shipping Working Together. It covers:

  • Converting a log shipping setup to a database mirroring partnership
  • Setting up log shipping to a 3rd destination (i.e. warm standby to go with the mirroring hot standby)
  • Swapping the roles of the mirroring server and the log shipping secondary server

It's short at 8 pages but has some good info in it. It's available to download here.

While I was poking about for a better download location, I discovered another new whitepaper (from last year) on database mirroring, this time on Implementing Application Failover with Database Mirroring. Its concerned with how make applications failover gracefully when a mirroring failover happens. Again, its not very log but there's some useful code examples for ADO.NET and JDBC. You can download it here.

I've added both of these to our whitepapers page too. Enjoy!

Thursday, January 10, 2008 6:02:51 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

There's a new whitepaper on TechNet that I've just come across (even though it was published 6 months ago!) called Predeployment I/O Best Practices. It's r