Monday, May 19, 2008

My first magazine article is in print! I've taken over the bi-monthly SQL Q&A column for TechNet Magazine and I just received the June magazine in the mail today with my first column in it. Topics covered are:

  • Creating corruption and using page checksums
  • The shrink-grow-shrink-grow trap
  • How many databases can be mirrored per instance
  • A tip on changing the default server port, from Jens Suessmeyer

I've also just completed a feature article for either the July or August issue dealing with database maintenance for the 'involuntary' DBA - more details when it gets published.

If you don't get the print version of TechNet Magazine, you can get to this month's SQL Q&A column at http://technet.microsoft.com/en-us/magazine/cc510328.aspx. There may not be anything new if you've been following my blog for a while, but if you've just started, it's worth a quick look.

Enjoy!

PS Let me know if you've got any good questions - I've already completed the August column but I'd like to hear of any questions you may have for later columns.

Monday, May 19, 2008 11:06:59 AM (Pacific Standard Time, UTC-08:00)  #    Comments [4]  | 
Wednesday, March 12, 2008

A couple of weeks ago I blogged about the three tracks of the SQL Server 2008 JumpStart course that SQLskills.com taught internally for Microsoft and some MVPs - see here for details. Well, the content is now available to download! Note that this was based on CTP-5 (November 2007 CTP) and there have been *lots* of behavioral changes since then (with more planned for CTP-6 Refresh and RTM), but if you want a high-level overview of a bunch of the features (albeit in bullet-point summaries on slides) then this is a good place to start.

For me, what's *REALLY* cool is that the site also has a downloadable VPC plus lab manuals for all of the AlwaysOn High-Availability hands-on labs that SQLskills.com wrote. We originally wrote these labs for SQL Server 2005 and I updated them all for CTP-5. The VPC has a long lab on each of the following:

  • Database Snapshots
  • Data Recovery and Preventative Techniques
  • Instant Initialization
  • Peer-to-Peer Replication (including the new Topology Wizard I blogged about here)
  • Table and Index Partitioning
  • Snapshot Isolation
  • Online Operations
  • Database Mirroring (including a demo I wrote of Automatic Page Repair, described here)
  • Service Oriented Database Architecture

There is some great depth in each of these - Kimberly blogged more info about the exercises in each lab here. There's also another VPC image with some higher-level labs on a variety of 2008 features and written by a number of different people- including some labs on Policy-Based Management and Performance Data Collection that Kimberly wrote.

So - where can you get these from? Go to http://sqlserver2008jumpstart.microsofttraining.com/ and hit the Download link on the right-hand side. Register and then you can get to the materials. The AlwaysOn VPC image is Collection 2 at the bottom of the page, and you'll see all the slide decks as you scroll down the page.

Enjoy!

Wednesday, March 12, 2008 9:23:23 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Tuesday, March 04, 2008

Wow - almost 10 days without a blog post - that must be a record for me! :-) Never fear - I'll be posting more over the next couple of weeks. Kimberly flew off to India yesterday to teach some Microsoft classes and unfortunately I couldn't join her this time as I'm teaching 3 classes myself:

  • an internal Microsoft class on Designing for High Availability
  • another internal Microsoft class on SQL Server 2008 for DBAs (similar to the JumpStart class I posted about here)
  • 3 days of content for the new Microsoft Certified Architect: Database qualification - see the Microsoft Learning site here for details

Anyway - the subject of this post is to let you know that last week, Kimberly and I did two interviews for TechNet Radio on SQL Server 2008 technologies. Part 1 has just been released where we discuss security and availability features. You can get to it by going to the March 4th 2008 show here. Tune in and find out how I lull myself to sleep when Kimberly's out of town...

Enjoy!

Tuesday, March 04, 2008 6:31:23 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, February 14, 2008

Phew - last week Kimberly and I spent 3 days teaching the ins-and-outs of SQL Server 2008 for DBAs/IT-Pros to about 130 Microsoft SQL Server experts and MVPs (like Kalen Delaney, Adam Machanic and Ron Talmage). This was the (95% complete) Beta delivery of a course we've been developing for the last six months for Microsoft that they'll use to train their SQL experts around the world on the new release. It's been very interesting watching the features develop through the CTPs (especially since I left the fold last August) - and making demos work on pre-release builds of the CTPs.

Teaching the course was a *blast* - the thing I love about teaching a really geeky crowd is the plethora of great questions and opportunities for going deep with explanations. Our team actually wrote and delivered the concurrently presented Developer and BI tracks as well. As you can see from the list below (and this is just the features a DBA needs to use/know about), SQL Server 2008 isn't a dot release of Yukon at all, as some people have suggested. Over the three days we covered:

  • Database Mirroring (D)
  • Backup Compression
  • Peer-to-Peer Replication (D)
  • Transparent Data Encryption (D)
  • Extensible (Off-Box) Key Management
  • All Actions Audited (D)
  • Policy-Based Management
  • Resource Governor (D)
  • Extended Events (D)
  • Spatial Indexes
  • Integrated Full-Text Search
  • Sparse Columns (D)
  • Filtered Indexes
  • Change Tracking
  • Change Data Capture (D)
  • FILESTREAM (D)
  • Performance Data Collection
  • Query Optimizer Enhancements
  • Data Compression (D)
  • Service Broker
  • Partition-Level Lock Escalation (D)

The features marked with a (D) are ones I demo'd during the course (Kimberly demo'd a bunch of the others - especially the tools features). Some of the demos were challenging to make work in time as we only got a pre-CTP6 build mid-January just before we headed off to China.

So why am I posting this? Well, a bunch of these features are in CTP-6, which should be just around the corner, and I have some easy-to-understand demos of them that I'll be posting here over the next month or so. Also, if this course sounds interesting, Kimberly and I will be teaching it in various configurations over the next year - starting with SQL Connections in April, a soon-to-be-announced class in Iceland in March, and the ITPro portion of TechEd in June.

Watch this space starting next week (today's the last day of six straight weeks of teaching for us so this weekend's a break :-))

Thursday, February 14, 2008 4:38:30 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Saturday, January 12, 2008

SQLskills.com has released its first ever utility! :-)

After several years of being asked to make available the Dual Database Monitor application (that's present inside the popular AlwaysOn Hands-On Labs we give out), I've spent a bunch of time making it configurable, work outside the VPC environment, and available in an easy kit form for people to use. It comes with easy-to-follow instructions and example SQL scripts to get you going. It supports SQL Server 2005 and SQL Server 2008.

This is an invaluable tool to use when setting up a Database Mirroring partnership to check that mirroring is working ok, without having to hack up your own application to do it. It can also be used to monitor two nodes of a peer-to-peer replication topology (or more using multiple instances of the monitor). It comes in two versions - single-user or unlimited use within a single company.

Check out the DDM webpage here for more details and ordering info.

Saturday, January 12, 2008 11:28:58 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, January 10, 2008

Just got an email notification of a new whitepaper from the SQL Customer Advisory Team on Database Mirroring and Log Shipping Working Together. It covers:

  • Converting a log shipping setup to a database mirroring partnership
  • Setting up log shipping to a 3rd destination (i.e. warm standby to go with the mirroring hot standby)
  • Swapping the roles of the mirroring server and the log shipping secondary server

It's short at 8 pages but has some good info in it. It's available to download here.

While I was poking about for a better download location, I discovered another new whitepaper (from last year) on database mirroring, this time on Implementing Application Failover with Database Mirroring. Its concerned with how make applications failover gracefully when a mirroring failover happens. Again, its not very log but there's some useful code examples for ADO.NET and JDBC. You can download it here.

I've added both of these to our whitepapers page too. Enjoy!

Thursday, January 10, 2008 6:02:51 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, December 12, 2007

While we were in Barcelona we sat down with Richard Campbell and Greg Hughes from RunAs Radio to record a 1/2 hour interview on SQL Server 2008. We touch on a ton of different features (look at the number of Categories I've tagged this with!) and have a bunch of laughs along the way - check it out here.

PS There's been a ton of interest in the slide deck idea I had so we'll be going ahead with that. Look for an announcement sometime in the first few months of next year about how to get them. Thanks to everyone that replied!

Wednesday, December 12, 2007 10:07:05 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, November 15, 2007

Here's a question that came up yesterday in our chalk-talk on database mirroring at TechEd IT Forum that Kimberly and I talked about this morning (here in Barcelona).

Q) I have a database mirroring session where the witness and mirror servers are in one physical location, and the principal server is in another. The mirroring session is running synchronously with the witness to allow automatic failover. A disaster happens to the site where the mirror and witness are, so the principal database is unavailable. I can't seem to access the principal at all to bring it back online by removing the witness and the mirror and witness won't be available for hours. What can I do?

A) The behavior you're seeing (the principal database becoming unavailable) is expected. In a mirroring configuration with a witness, the principal needs to have quorum with (i.e. be able to see) at least one of the other partners, either the mirror, the witness, or both. If it can't see either, it doesn't know whether the witness and mirror can still see each other and the mirror may have brought itself online as the new principal. (Kimberly likes to say that the principal thinks the witness and mirror are conspiring against it :-)) In this case though, the customer knows that the mirror and witness are actually down and so he wants to bring the principal database back online.

I repro'd this situation in a VPC with three SQL Server 2008 instances running mirroring between them (the behavior is exactly the same in 2008 and 2005). I did a net stop on the mirror and witness servers and the principal database went offline. Trying to get into the principal database results in the following error:

USE TicketSalesDB;

GO

 

Msg 955, Level 14, State 1, Line 1

Database TicketSalesDB is enabled for Database Mirroring, but the database lacks quorum: the database cannot be opened.  Check the partner and witness connections if configured.

This is what I'd expect. The customer tried to remove the witness so let's try that:

ALTER DATABASE TicketSalesDB SET WITNESS OFF;
GO

Msg 1431, Level 16, State 4, Line 1
Neither the partner nor the witness server instance for database "TicketSalesDB" is available. Reissue the command when at least one of the instances becomes available.

That doesn't work either because removing the witness needs to happen on one of the partners as well as the principal. The only way to get out of this situation is to break the mirroring partnership completely.

ALTER DATABASE TicketSalesDB SET PARTNER OFF;
GO
USE TicketSalesDB;
GO

Command(s) completed successfully.

Thursday, November 15, 2007 7:31:48 PM (Pacific Standard Time, UTC-08:00)  #    Comments [2]  | 
Friday, November 09, 2007

This is a question I was asked multiple times over the last week: the Microsoft guidelines for database mirroring say not to mirror more than 10 databases per instance - why is that and is it true?

The answer is my favorite 'it depends!'. The number 10 is a rough guess at the sweet-spot for the majority of customers. The factors that need to be considered are:

  • How much memory do the principal and mirror instances have? (hopefully the same)
  • How much processing-power do the principal and mirror instances have? (hopefully the same)
  • How much bandwidth does the IO subsystem have on the mirror instance? (hopefully the same as on the principal)
  • How much transaction log does the workload on each database generate?
  • How much network bandwidth is available between the principal and the mirror instances?

The last two factors are the most critical. If the network bandwidth available between the two instances is not enough to handle the combined transaction log generation rate per second from all databases being mirrored then performance will drop on the principal databases. SQL Server 2008 does alleviate some of this with log stream compression - see here for details. The next most critical thing to consider is the memory and thread requirements for mirroring - each mirrored database takes one thread plus some memory, so on low-powered servers, lots of mirrored databases may be too much load on the server when combined with the regular workload.

Here are some examples that I've seen:

  • A customer with 150 databases, all of which have very small amounts of activity, and not all at the same time, has them all mirrored with no problem.
  • A customer with only 3 heavily-loaded databases, but without a great network connection, that can barely mirror one of the databases without the lack of network bandwidth causing workload degradation.

The key to success here is to do the log generation calculation and then if it seems that the available network bandwidth will support the number of databases you want to mirror, test it first before relying on it in production.

I guess the bottom-line here is that any broad guidance is only that - your mileage may (and probably will) vary. Always do your own calculations and testing.

Friday, November 09, 2007 3:58:53 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Monday, November 05, 2007

Sitting here in our Disaster Recovery class at SQL Connections and Kimberly's on till lunch so I'm banging out a quick blog post covering the database mirroring (DBM) specific questions.

Q1) Can I use IP addresses instead of server names when using the DBM Monitor?

A1) Unfortunately not.

Q2) Is there any in-built throttling mechanism in DBM to allow the mirror to catch-up when synchronizing a synchronous mirroring session?

A2) Yes, if there's more than 1MB of transaction log on the principal that hasn't been sent to the mirror, the mirroring session state will be switched from SYNCHRONIZED to SYNCHRONIZING and the principal itself will start adding a few milliseconds delay to transaction commits until the amount of unsent log drops below 1MB

Q3) Are there any tips when setting up a mirroring session using backups?

A3) Yes, make sure that all the backups are restored WITH NORECOVERY on the mirror. The database has to be unrecovered otherwise the mirroring session cannot start and you're back to square one with restoring the mirror database.

Q4) What are the performance considerations with DBM - both in terms of the impact on the application workload and on DBM itself?

A4) Here are some links to resources that discuss this:

Q5) Any other resources?

A5) Blog posts...

Enjoy!

Monday, November 05, 2007 11:45:16 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, October 25, 2007

(Been a few days since I posted - had some real work to do :-) Today I'll post a few things from the queue that's been building up)

This is part Q&A and part follow-on from my last post about running index maintenance when a database is mirrored.

A customer has a maintenance plan that involves running regular ALTER INDEX ... REORGANIZE on a 100GB clustered index to remove fragmentation. Three weeks ago they added database mirroring, with the database setup for synchronous mirroring. Every so often, they see the state of the mirror change from SYNCHRONIZED to SYNCHRONIZING and then a bit later back to SYNCHRONIZED. What's going on? Once a synchronously-mirrored database is synchronized, it should ever get out of sync, right?

Well not quite - if the communication link between the principal and the mirror is broken, then the mirror becomes unsynchronized. The exact behavior in this situation depends on how mirroring is setup and what's failed:

  1. If there's no witness instance, then transactions will continue on the principal database but the transaction log starts to grow, because the transactions can't be cleared from the principal's log (even after a log backup) until they've been sent to the mirror. The database is running 'exposed'. Once the link is reestablished, the mirror while synchronize again.
  2. If there's a witness, and the witness can still talk to the principal, then everything continues as in #1
  3. If there's a witness, and the communication link between it and principal is also broken, the the principal will stop serving the database - transactions will stop. In this case, if the mirror and the witness can still see each other, then a failover will occur.

There are some great Books Online entries that describe all of this - see http://msdn2.microsoft.com/en-us/library/ms179344.aspx to start with.

The customer had situation #1. Every so often the mirror would change state and it seemed to coincide with the defrag job. Looking in the error log shows messages like:

2007-10-24 11:43:36.21 spid23s     Error: 1474, Severity: 16, State: 1.

2007-10-24 11:43:36.21 spid23s     Database mirroring connection error 2 'Connection attempt failed with error: '10060(A connection attempt failed because the connected party did not properly respond after a period of time, or established connection failed because connected host has failed to respond.)'.' for 'TCP://roadrunnerpr.sqlskills.com:5022'.

So the network link was dying sometimes when the defrag was running - that explains the switch between SYNCHRONIZED and SYNCHRONIZING. Why the network link was dying is still under investigation but it seems like the additional transaction log generated by the defrag job was causing the network to become overloaded and some component of it wasn't behaving correctly under load.

There are a few things to learn from this:

  1. Not only do you need to make sure that your IO subsystem can handle the load on it correctly, you also need to make sure your network can handle the load on it. There are a bunch of tools available to stress-test network paths - one simple one is TrafficEmulator.
  2. When you're running on your test system before going into production, make sure you test *everything* as if you were running in production - including maintenance jobs because they can add significant load to a production system.
  3. When you implement an HA solution such as mirroring, consider all the ways that transaction log will be generated when figuring out the required network bandwidth to support your HA configuration - something like a defrag or rebuild can cause an enormous spike in log generation
Thursday, October 25, 2007 9:04:07 AM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Thursday, October 18, 2007

There's been some discussion over on the SQL Server Central forums about database mirroring performance, and one question in particular on how to do index rebuilds on mirrored VLDBs.

Remember that in database mirroring, the database has to be in the full recovery mode, so all index rebuilds are fully logged. It all comes down to the amount of transaction log generated and whether this causes a problem.

  • In synchronous mirroring, the additional log being generated could overload the network link between the principal and mirror. This will slow down the hardening of regular transactions in the mirror's transaction log, and thus led to a decrease in transaction throughput on the principal.
  • In asynchronous mirroring, the additional log being generated could again overload the network link - but this time there's no requirement for transactions to harden in the mirror before they can commit on the principal, so instead the SEND queue on the principal gets really large. Now, in asynchronous mirroring the SEND queue represents the amount of work that will be lost if a failover occurs, as its all the transaction log that hasn't yet been sent to the mirror. So, in asynchronous mode, a large index rebuild operation could lead to increase exposure to data loss in the event of a failover.

In SS2008, the log stream compression I blogged about here should go a long way to alleviating this problem. There are three ways I can think of to reduce the amount of transaction log generated by and index maintenance plan in SS2005:

  1. Use a potentially less expensive (in terms of logging) solution for removing index fragmentation. Doing an index rebuild is guaranteed to generate an equivalent amount of transaction log to the size of the index being rebuilt, no matter how much fragmentation there is - because an index rebuild always rebuilds the entire index. The alternative is to do an index reorganize, either using my old DBCC INDEXDEFRAG or the new ALTER INDEX ... REORGANIZE. These will only generate transaction log when index pages are compacted and reorganized - so for less heavily fragmented indexes. There's no hard and fast rule here but I generally say where Logical Scan Fragmentation/Average Fragmentation in Percent from DBCC SHOWCONTIG/sys.dm_db_index_physical_stats, respectively, is less than 30%. You also need to consider page density too - but really this is a topic for a whole other post.
  2. Be very selective on which indexes you choose to rebuild/reorganize. A lot of people have a maintenance plan that does this for every index every week, without checking whether the index is even fragmented or whether removing fragmentation for an index improves workload performance. Again, this a whole other topic but there is an old whitepaper for SS2000 I helped with that describes some of this - Microsoft SQL Server 2000 Index Defragmentation Best Practices.
  3. Partition the tables/indexes so that the changing portion of the data is the only portion that's affected by index maintenance. If most of your data is read-only, there's no point in having it included in reindex/reorganize operations, right? Kimberly recently wrote a blog post about such an architecture here.

For more info on database mirroring performance considerations, checkout the whitepaper Database Mirroring: Best Practices and Performance Considerations. There's also a slide deck presentation based on this whitepaper that's been presented at various TechEds (I did it in China and Hong Kong last year and Kimberly did it in South Africa last year) - you can download it from the Hong Kong website here.

The bottom line is when mirroring is in the mix, you need to be more intentional with your database maintenance.

Thursday, October 18, 2007 4:37:43 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Wednesday, October 10, 2007

As I mentioned in a previous post, one of the new features for database mirroring in SQL Server 2008 is log stream compression. There's a good explanation of what this is (along with some example workloads and performance graphs) on the SQL Customer Advisory Team blog so I'm not going to duplicate all that here.

Basically, the way mirroring works is by shipping all the transaction log from the principal to the mirror, and then replaying the log records on the mirror database. Network bandwidth can be a bottleneck for mirrored databases where the rate of transaction log generation is high enough to saturate the network. On synchronously mirrored databases (where a transaction cannot commit on the principal until all the transaction log for it has been hardened in the log of the mirror database), this can even lead to workload throughput degradation.

Log compression is a way to alleviate this problem. SS2008 Books Online states that compression rates of at least 12.5% are achieved - obviously the compression ratio could be much higher than that and is dependant on what's being compressed - i.e. the data that is being processed by the application. One of the tests they did for the post above shows a 5x compression ratio - that's pretty good.

The downside of this is that compression is not free - extra CPU must be used. The obvious CPU load increases comes from having to compress the transaction log on the principal and then decompress it again on the mirror before hardening it in the log. The non-obvious increase will come because the principal and mirror should end up being able to process more transactions/second, which means more CPU is needed.

This extra CPU could be a problem for some systems that are already pegged in terms of CPU load, and so they may actually see a drop in performance when log stream compression is enabled. Now, it's on by default when you upgrade to SS2008 but there is a way to turn it off. The blog post above divulges that there is a trace flag, 1462, that turns off log stream compression and effectively reverts the behavior back to SS2005.

Apart from the obvious benefits of this feature for systems that can afford the extra CPU, I'm excited because it may allow more mirrored databases to perform effective database maintenance - in the form of fragmentation control and index maintenance. When database mirroring is enabled, the database has to be running in the full recovery mode. This means that operations such as index rebuilds are fully logged - that's a lot of extra log being generated if you're used to switching to bulk-logged recovery mode to perform index operations. Of course, this isn't an issue for reorganizing an index (with ALTER INDEX ... REORGANIZE) as this has been fully logged (and has to be that way because of how it works) since I wrote the old DBCC INDEXDEFRAG for SS2000.

Anyway, for some customers the fact that these two operations are fully logged means that running them produces so much additional log that, in conjunction with the log from the workload itself, the network becomes a bottleneck for mirroring and it slows down both the maintenance and reduces the workload throughput. The addition of log stream compression means that for these systems, the log may be compressed enough so that the network is no longer a bottleneck and more regular fragmentation control can take place. This in turn will increase workload throughput as well - a double benefit!

Wednesday, October 10, 2007 4:06:51 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 
Monday, October 01, 2007

Ok - I'm on a roll today so to finish off I'd like to repost some info about database mirroring failover types (including how you may not actually get a failover when you expect it) and a tip for how to avoid unwanted failovers when combining clustering and mirroring. Mirroring failovers have cropped up several times in my blog's search engine logs so this is an easy one to answer. This is based on a TechEd post from June on the Storage Engine blog.

Clustering, mirroring and altering the partner timeout

The first question is about combining failover clustering and database mirroring. A customer wanted to put the instance hosting the principal database on a failover cluster and have the mirror database on an unclustered server. The issue he saw with the setup was when something happened on the active cluster server and it failed over. The failover would take 60-90 seconds, which means the principal database would not be available during that time.

The problem stems from the way mirroring figures out whether to initiate a failover. The mechanism it uses is to have all SQL Server instances within a mirroring session send out 'pings' on each of its open mirroring connections every second. If an instance does not receive a ping from one of the other instances in the session within a defined time-out period (called the partner time-out), it considers the other instance as unavailable and takes the appropriate action. The default time-out period is ten seconds.

In the case above, when the cluster is failing over and the principal instance is unavailable for 60-90 seconds, the other instances in the mirroring session will time-out after ten seconds and initiate a mirroring failover. Although this is the correct behavior of mirroring, it is undesirable behavior for this customer. The solution to the problem is to change the partner time-out value - a facility that isn't well known.

To change the partner time-out value for a mirroring session, use the following code:

ALTER DATABASE mydatabase SET PARTNER TIMEOUT 90;

GO

The value at the end of the statement is the new partner timeout value in seconds. There are a couple of things to be aware of if you're going to do this:

  • You can only issue this statement on the principal server.
  • Be very careful not to set the time-out value too low otherwise you run the risk of triggering failovers becuase of false-failures - especially on heavily-loaded systems that may not be able to respond within the time-out period. In fact, if you specify a time-out value of 4 seconds or lower, SQL Server will automatically set the time-out to 5 seconds to help avoid such problems.

Database mirroring failover types

What are the different kinds of failures that can trigger mirroring failovers, and how quickly does the failover happen after the problem occurs? As with most questions I get, I can use my favorite answer of "It depends!" :-) Let's look at some examples of failures and see how quickly the failover occurs, in decreasing order of speed.

  • Fastest: The fastest possible failover occurs when the SQL Server instance crashes (so the mirroring connection endpoint no longer exists) but the operating system is still running. When the next ping comes from a partner instance, the OS knows that the network port (that was being used by the mirroring connection endpoint) is no longer being listened to and returns a failure. This immediately triggers a failover.
  • Fast: The next fastest failover occurs when the machine hosting the SQL Server instance crashes or shuts down (e.g. power supply cord pulled out of the back). In this case, nothing happens until the partner time-out period has expired (as there's no OS running to return the immediate failure) and then a failover will occur. By default this will be in ten seconds, but you can change this as I explained above.
  • Slow: A slow failover is when something happens to the server but it takes a while for the database to go offline. An example of this would be someone pulling out the transaction log drive on the principal server. Writes to the transaction log will start to queue up. After 20 seconds SQL Server will issue an IO warning but it isn't until 40 seconds has passed that SQL Server issues an IO failure and the database goes suspect. It's not until this point that the failover occurs - even though the partner timeout value is 10 seconds!
  • Maybe fast or not at all: This discussion all started by someone asking about how quickly a failover happens if a page checksum failure is detected. My answer was - it depends! If the page checksum failure happens during a query, then all that happens is that the checksum failure is reported, the query rolls back, and the connection is broken. It's only if the page checksum failure happens during a transaction rollback that a failover will be triggered. This is because a transaction rollback failure means the database is in a transactionally inconsistent state and has to be set to suspect mode - which triggers a mirroring failover.

So, don't assume that just because mirroring is setup that every failure will trigger a fast failover.

Monday, October 01, 2007 3:31:44 PM (Pacific Standard Time, UTC-08:00)  #    Comments [0]  | 

Part of the improvements to database mirroring in SQL Server 2008 are the addition of 10 new performance counters under the Database Mirroring Performance Object. These add some cool troubleshooting capabilities.

SQL Server 2005 provided the following 11 counters (from Books Online):

Name Description

Bytes Received/Sec

Number of bytes received per second.

Bytes Sent/sec

Number of bytes sent per second.

Log Bytes Received/sec

Number of bytes of log received per second.

Log Bytes Sent/sec

Number of bytes of log sent per second.

Log Send Queue

Total number of bytes of log that have not yet been sent to the mirror server.

Pages Sent/sec

Number of pages sent per second.

Receives/sec

Number of mirroring messages received per second.

Redo Bytes/sec

Number of bytes of log rolled forward on the mirror database per second.

Redo Queue

Total number of bytes of hardened log that currently remain to be applied to the mirror database to roll it forward. This is sent to the Principal from the Mirror.

Sends/sec

Number of mirroring messages sent per second.

Transaction Delay

Delay in waiting for unterminated commit acknowledgement.

SQL Server 2008 now provides 21 counters, with the new ones highlighted in red. This info is taken from the 2008 July CTP 08Books Online that is downloadable here.

Name Description

Bytes Received/sec

Number of bytes received per second.

Bytes Sent/sec

Number of bytes sent per second.

Log Bytes Received/sec

Number of bytes of log received per second.

Log Bytes Redone from Cache/sec

Number of redone log bytes that were obtained from the mirroring log cache, in the last second.

This counter is used on only the mirror server. On the principal server the value is always 0.

Log Bytes Sent from Cache/sec

Number of sent log bytes that were obtained from the mirroring log cache, in the last second.

This counter is used on only the principal server. On the mirror server the value is always 0.

Log Bytes Sent/sec

Number of bytes of log sent per second.

Log Compressed Bytes Rcvd/sec

Number of compressed bytes of log received, in the last second.

Log Compressed Bytes Sent/sec

Number of compressed bytes of log sent, in the last second.

Log Harden Time (ms)

Milliseconds that log blocks waited to be hardened to disk, in the last second.

Log Remaining for Undo KB

Total kilobytes of log that remain to be scanned by the new mirror server after failover.

This counter is used on only the mirror server during the undo phase. After the undo phase completes, the counter is reset to 0. On the principal server the value is always 0.

Log Scanned for Undo KB

Total kilobytes of log that have been scanned by the new mirror server since failover.

This counter is used on only the mirror server during the undo phase. After the undo phase completes, the counter is reset to 0. On the principal server the value is always 0.

Log Send Flow Control Time (ms)

Milliseconds that log stream messages waited for send flow control, in the last second.

Sending log data and metadata to the mirroring partner is the most data-intensive operation in database mirroring and might monopolize the database mirroring and Service Broker send buffers. Use this counter to monitor the use of this buffer by the database mirroring session.

Log Send Queue KB

Total number of kilobytes of log that have not yet been sent to the mirror server.

Mirrored Write Transactions/sec

Number of transactions that wrote to the mirrored database and waited for the log to be sent to the mirror in order to commit, in the last second.

This counter is incremented only when the principal server is actively sending log records to the mirror server.

Pages Sent/sec

Number of pages sent per second.

Receives/sec

Number of mirroring messages received per second.

Redo Bytes/sec

Number of bytes of log rolled forward on the mirror database per second.

Redo Queue KB

Total number of kilobytes of hardened log that currently remain to be applied to the mirror database to roll it forward. This is sent to the Principal from the Mirror.

Send/Receive Ack Time

Milliseconds that messages waited for acknowledgement from the partner, in the last second.

This counter is helpful in troubleshooting a problem that might be caused by a network bottleneck, such as unexplained failovers, a large send queue, or high transaction latency. In such cases, you can analyze the value of this counter to determine whether the network is causing the problem.

Sends/sec

Number of mirroring messages sent per second.

Transaction Delay

Delay in waiting for unterminated commit acknowledgement.

Here's a little more explanation and what you can use these new performance counters to troubleshoot:

  • Log Bytes Redone from Cache/sec
    • This measures how much of the transaction log in the redo queue is being read by the log redo task from the mirror's in-memory transaction log cache. Reading from the cache is a lot faster than having to read from the mirror's actual transaction log. Even though the log gets hardened on the mirror database's log disk, it does not need to be removed from the cache until the cache fills up with new transaction log from the principal.
    • You could think of this as a cache hit ratio measure for the redo queue.
    • If this number is lower than usual, it means that transaction log is arriving from the principal faster than the log redo task can roll forward the transaction log in the redo queue.
  • Log Bytes Sent from Cache/sec
    • This is similar to the counter above. It measures how much of the transaction log being sent from the principal to the mirror is being read from the principal's in-memory transaction log cache. Sending from the cache is a lot faster than having to go to the transaction log itself and read from disk.
    • You could think of this as a cache hit ratio for the send queue.
    • If this number is lower than usual it means that the transaction log is being generated on the principal faster than it can be sent to the mirror.
  • Log Compressed Bytes Rcvd/sec
  • Log Compressed Bytes Sent/sec
    • These are simple counters and can be used with the Log Bytes Sent/sec and Rcvd/sec counters to determine the compression ratio. Log stream compression is another enhancement in SQL Server 2008 that I'll cover in a future post.
  • Log Harden Time (ms)
    • This measures the delay between the mirror server receiving a chunk of transaction log and it being hardened on the mirror database's log disk (i.e. the delay before the chunk of transaction log becomes part of the redo queue on the mirror server).
    • If this number is higher than normal it means the mirror database's log disk is more heavily loaded and may be  becoming saturated.
  • Log Remaining for Undo KB
  • Log Scanned for Undo KB
    • The Books Online entries for these counters are self-explanatory.
    • These counters give a way to monitor the undo phase after a failover occurs.
  • Log Send Flow Control Time (ms)
    • This measures how long a mirroring connection had to wait before it could us the mirroring flow control buffer.
    • If this number is higher than normal it means there is contention for the buffer, most likely because there are too many Database Mirroring partnerships running from a single instance.
  • Mirrored Write Transactions/sec
    • As Books Online mentions, this counts the number of transactions in the principal database that had to wait for a commit record to harden in the mirror database's transaction log.
    • If this value is lower than normal (for the same application workload) it means there is a bottleneck somewhere in the system.
  • Send/Receive Ack Time
    • As Books Online mentions, this can be used to measure network latency between the principal and mirror servers.
    • If this value is larger than normal it means that there is a network bottleneck between the principal and mirror servers.

Hopefully Microsoft will publish a whitepaper or some troubleshooting scenarios showing these counters being used.

Monday, October 01, 2007 12:31:38 PM (Pacific Standard Time, UTC-08:00)  #    Comments [1]  | 
Friday, September 28, 2007

This was a question from the MSDN Disaster Recovery forum I started while I was at Microsoft.

I have a 600 gig database that has a mirror.  I need to move the databases from local drives to a SAN.  Can anyone recommend a document that lists the steps to go through to move both the principle and mirror to the SAN with no down time? or minimal down time?

As far as I know, there isn't any such document so I had a crack at coming up with a list of operations. Here's what I had:

  1. Take a full backup of the principal on node A
  2. Restore it on the SAN on node B using WITH NORECOVERY, remembering to use WITH MOVE to place the files correctly, and with a different database name than the current mirror
  3. Take the required log backup on the principal and restore on the database copy on the SAN on node B
  4. Break the mirroring partnership
  5. Drop the current mirror database on node B
  6. Rename the database on the SAN on node B to be the mirror database -- THIS DOESN"T WORK!
  7. Setup the mirroring partnership to point to the newly restored database on the SAN on node B
  8. Start mirroring and the new mirror will catch-up
  9. Failover to the mirror on node B, which becomes the new principal
  10. Follow the same procedure to move the new mirror on node A onto its SAN
  11. Failback if you want to

And I promised to try it out to make sure I had it right so in this blog post I'm going to walk through the steps of doing this. It turns out that the steps above are slightly incorrect. Step 6 above doesn't work because the database is in recovery (so is inaccessible) and there's a short-cut when moving the database on the first node to avoid having to take and copy more backups. Let's see how it works and I'll post the corrected sequence at the end.

As I did in yesterday's mirroring post, I'm going to use the TicketSalesDB database from our Always-On DVDs. It's only a few hundred MB instead of 600GB but the principal is the same (no pun intended :-)). I've got mirroring running between two nodes, SQLDEV01 (the principal) and SQLDEV02 (the mirror), both of which are running 2005 SP2 and I've got a simulated workload inserting rows into the database. I don't actually have a SAN laying around so I'm cheating and I have directories called C:\SQLDEV01SAN and C:\SQLDEV02SAN instead. It's the location change that's the interesting part, not where the actual location is.

Step 1

On SQLDEV01, take a full backup and a log backup:

BACKUP DATABASE TicketSalesDB TO DISK = 'C:\SQLskills\TicketSalesDB.BAK' WITH INIT;

GO

BACKUP LOG TicketSalesDB TO DISK = 'C:\SQLskills\TicketSalesDB_Log.bak' WITH INIT;

GO

Step 2

On SQLDEV01, break the mirroring partnership:

ALTER DATABASE TicketSalesDB SET PARTNER OFF;

GO

And just check that it's gone:

SELECT mirroring_state_desc FROM sys.database_mirroring WHERE database_id = DB_ID ('TicketSalesDB');

GO

which returns:

NULL

Step 3

On SQLDEV02, drop the mirror database - this wouldn't work unless mirroring was no longer running:

DROP DATABASE TicketSalesDB;

GO

Step 4

Copy the backups to SQLDEV02 and restore them on the SAN and remembering to use WITH NORECOVERY:

RESTORE DATABASE TicketSalesDB FROM DISK='C:\SQLskills\TicketSalesDB.bak'

WITH

MOVE 'TicketSalesDBData' TO 'C:\SQLDEV02SAN\TicketSalesDBData.MDF',

MOVE 'TicketSalesFG2005Q1' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q1.NDF',

MOVE 'TicketSalesFG2005Q2' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q2.NDF',

MOVE 'TicketSalesFG2005Q3' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q3.NDF',

MOVE 'TicketSalesFG2005Q4' TO 'C:\SQLDEV02SAN\TicketSalesFG2005Q4.NDF',

MOVE 'TicketSalesDBLog' TO 'C:\SQLDEV02SAN\TicketSalesDBLog.LDF',

NORECOVERY;

GO

RESTORE LOG TicketSalesDB FROM DISK = 'C:\SQLskills\TicketSalesDB_Log.bak' WITH NORECOVERY;

GO

Step 5

On SQLDEV02, set the mirroring partner to be SQLDEV01:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV01:5091';

GO

Step 6

On SQLDEV01, start mirroring:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV02:5092';

GO

And check that it's running:

SELECT mirroring_state_desc FROM sys.database_mirroring WHERE database_id = DB_ID ('TicketSalesDB');

GO

This time it returns:

SYNCHRONIZED

Step 7

Now we need to failover so that we can move the database on SQLDEV01 onto its SAN. Before we do that, let's make sure that SQLDEV01 is the principal:

SELECT mirroring_role_desc FROM sys.database_mirroring WHERE database_id = DB_ID ('TicketSalesDB');

GO

which returns:

PRINCIPAL

Now force the failover:

ALTER DATABASE TicketSalesDB SET PARTNER FAILOVER;

GO

And query the DMV again to make sure. This time the mirroring_state_desc returned is:

MIRROR

Excellent!

Now, I did all of this while my workload was running and it automatically failed over to SQLDEV02, with the database now hosted on the SAN. To do the same move on SQLDEV01, we don't need to go through the backup and copy process again - we can just use the original backups we took in step 1.

Step 8

We need to break the mirroring partnership again, this time executing on SQLDEV02, the new principal:

ALTER DATABASE TicketSalesDB SET PARTNER OFF;

GO

On SQLDEV01, we can now drop the database and restore the original backups onto the SAN:

DROP DATABASE TicketSalesDB;

GO

RESTORE DATABASE TicketSalesDB FROM DISK='C:\SQLskills\TicketSalesDB.bak'

WITH

MOVE 'TicketSalesDBData' TO 'C:\SQLDEV01SAN\TicketSalesDBData.MDF',

MOVE 'TicketSalesFG2005Q1' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q1.NDF',

MOVE 'TicketSalesFG2005Q2' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q2.NDF',

MOVE 'TicketSalesFG2005Q3' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q3.NDF',

MOVE 'TicketSalesFG2005Q4' TO 'C:\SQLDEV01SAN\TicketSalesFG2005Q4.NDF',

MOVE 'TicketSalesDBLog' TO 'C:\SQLDEV01SAN\TicketSalesDBLog.LDF',

NORECOVERY;

GO

RESTORE LOG TicketSalesDB FROM DISK = 'C:\SQLskills\TicketSalesDB_Log.bak' WITH NORECOVERY;

GO

And setup mirroring again. On SQLDEV01:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV02:5092';

GO

And on SQLDEV02:

ALTER DATABASE TicketSalesDB SET PARTNER = 'TCP://SQLDEV01:5091';

GO

And we're running again.

Step 9

Now all we need to do is fail the workload back to SQLDEV01 by executing this on SQLDEV02:

ALTER DATABASE TicketSalesDB SET PARTNER FAILOVER;

GO

Summary

So - the corrected sequence for moving a database while mirroring is running is the following:

  1. Take a full backup of the principal database on node A, and the required log backup
  2. Break the mirroring partnership
  3. Drop the current mirror database on node B
  4. Copy the backups to node B and restore it on the SAN on node B using WITH NORECOVERY, remembering to use WITH MOVE to place the files correctly
  5. Setup the mirroring partnership to point to the newly restored database on the SAN on node B
  6. Start mirroring and the new mirror will catch-up
  7. Failover to the mirror on node B, which becomes the new principal
  8. Follow the same procedure to move the new mirror on node A onto its SAN, but using the original backups from step 1
  9. Failback

Hope this helps.

Friday, September 28, 2007 7:27:44 AM (Pacific Standard Time, UTC-08:00)  #