Ignorance is not stupidity

You’re ignorant. About lots of things. Yes, you are.

Feel offended? We’re all ignorant about lots and lots of things.

Last week I wrote a non-technical blog post The Golden Rule – maybe just optional now? about the growing lack of civility in the world at large. That elicited almost 40 responses (thanks!) so I’m going to intersperse more non-technical posts in with the technical ones – as I have a lot of views to express :-)

This one is about ignorance. In my opinion many people don’t understand ignorance and take offense if you say they are ignorant about such and such. In fact being ignorant just means that you don’t know something – it’s not derogatory or a statement of blame.

And it definitely does not mean that someone is stupid. However, time and again I see ignorance equated with stupidity. This is very common to see on internet forums where question posters can be heavily railed on by more experienced people for not knowing X or Y about SQL Server. And do you think that’s going to make them come back to ask more questions to learn about SQL Server? No.

[Edit: I’m not saying that people use the word ‘ignorant’ all the time, but the implication is there.]

An example: yesterday on a distribution list I’m on, someone asked what to tell a DBA who insists on rebooting SQL Server regularly. The first knee-jerk reply to the alias was unfortunately “You’re fired!”

Wrong answer.

So often I see knee-jerk responses to problems and questions – and they’re usually wrong – as in this case. Jumping to conclusions quickly can be damaging – something I used to do a long time ago when I first started my career and I was slowly trained out of it (thankfully).

The correct response to the original question would be to ask the DBA *why* the server is being rebooted and the rationale behind rebooting being the preferred fix for the problem. And then educate.

Here’s something you may not have thought about: every single person in the world starts out with absolutely zero knowledge about SQL Server.

When I joined the SQL Server team on February 1st 1999, I knew zip about SQL Server. Now I know lots about many aspects of it, and still zip about many other aspects of it. The same goes for Kimberly. And Kalen. And Itzik. And many others that the community considers an expert in SQL Server. Nobody knows everything about everything, and everyone starts from scratch at some point in their careers.

Today there is a growing proliferation of involuntary DBAs who have to deal with the big, complex beast we know of as SQL Server. You can’t expect people to know everything straight away. You can’t expect people to necessarily know what they *should* know, that comes with experience. And even experienced DBAs who’ve learned that SQL Server does X sometimes don’t know that the behavior changed and now it does Y instead. Sometimes people aren’t given time to learn by their employers, and non-work commitments stop them spending hours of their own time learning.

So next time you see someone asking a question that you think is so simple that they should know the answer, or that *everyone* knows you shouldn’t do X or Y, cut them some slack and educate them nicely. Empathize. Don’t belittle them. Don’t rail on them.

And don’t equate ignorance with stupidity.

Disaster recovery 101: dealing with negative SPIDS (-2 and -3)

Every so often I get asked how to deal with SPID -2 and once in a blue moon I get asked about SPID -3. Neither of these SPIDs will show up in sysprocesses or sys.dm_exec_requests or sys.dm_exec_sessions, but they’re both valid SPIDs.

SPID -2 is an orphaned DTC transaction.

SPID -3 is a deferred transaction.

Both SPIDs will hold locks that can cause blocking, and these are often how they’re noticed in the first place.

SPID -4 is a transaction blocked waiting for a latch but the latch owner can’t be determined. There’s no good way to deal with these.

For a good walkthrough of how to kill an orphaned DTC transaction if you cannot resurrect it any other way, see this excellent post by Ajmer Dhariwal (ex-Product Support, and a student in our London Immersion Event this week).

A deferred transaction is one where the transaction could not be recovered fully because of an I/O error or a gross file system problem like an unavailable file. Books Online has a good description here – so I won’t regurgitate it.

What I’d like to do is show you how to create a deferred transaction so you can actually see SPID -3 and how to tell if you have deferred transactions.

First off I’ll create a database, with an extra file sitting on a USB drive (G:) in my laptop.

CREATE DATABASE [DeferredTran];
GO

ALTER DATABASE [DeferredTran] ADD FILEGROUP [FG_USB];
GO

ALTER DATABASE [DeferredTran] ADD FILE (
    NAME = [FG_USB_1], FILENAME = N'G:\FG_USB_1.ndf')
TO FILEGROUP [FG_USB];
GO

BACKUP DATABASE [DeferredTran] TO DISK = N'C:\SQLskills\DeferredTran.bck' WITH INIT;
GO

Next I’ll create a table on the USB filegroup, and create an explicit transaction and force the log and data to disk.

USE [DeferredTran];
GO

CREATE TABLE [t1] ([c1] INT) ON [FG_USB];
GO

BEGIN TRAN
INSERT INTO [t1] VALUES (1);
GO

CHECKPOINT;
GO

In another window, crash SQL Server using SHUTDOWN WITH NOWAIT. Then unplug the USB drive and restart SQL Server.

Trying to access the database will fail, because the boot page says the database wasn’t shut down cleanly, but the data file necessary to roll back our transaction isn’t available.

USE [DeferredTran];
GO
Msg 945, Level 14, State 2, Line 1
Database 'DeferredTran' cannot be opened due to inaccessible files or insufficient memory or disk space.  See the SQL Server errorlog for details.
ALTER DATABASE [DeferredTran] SET ONLINE;
GO
Msg 5120, Level 16, State 5, Line 1
Unable to open the physical file "G:\FG_USB_1.ndf". Operating system error 2: "2(The system cannot find the file specified.)".
Msg 945, Level 14, State 2, Line 1
Database 'DeferredTran' cannot be opened due to inaccessible files or insufficient memory or disk space.  See the SQL Server errorlog for details.
Msg 5069, Level 16, State 1, Line 1
ALTER DATABASE statement failed.

By explicitly setting that file offline, we can bring the database online, albeit with a deferred transaction. BE CAREFUL! The only way to bring that file online again is to restore it from a backup!

ALTER DATABASE [DeferredTran] MODIFY FILE (NAME = N'FG_USB_1', OFFLINE);
GO

ALTER DATABASE [DeferredTran] SET ONLINE;
GO

From the error log…

2011-06-24 09:54:27.650 spid51       Setting database option ONLINE to ON for database DeferredTran.
2011-06-24 09:54:27.650 spid51       Starting up database 'DeferredTran'.
2011-06-24 09:54:27.680 spid51       Filegroup FG_USB in database DeferredTran is unavailable because it is Offline. Restore or alter the filegroup to be available.
2011-06-24 09:54:27.690 spid22s      Error: 3410, Severity: 16, State: 1.
2011-06-24 09:54:27.690 spid22s      Data in filegroup FG_USB is offline, and deferred transactions exist. Use RESTORE to recover the filegroup, or drop the filegroup if you never intend to recover it. Log truncation cannot occur until this condition is resolved.
2011-06-24 09:54:27.690 spid22s      Error: 3314, Severity: 21, State: 1.
2011-06-24 09:54:27.690 spid22s      During undoing of a logged operation in database 'DeferredTran', an error occurred at log record ID (24:101:23). Typically, the specific failure is logged previously as an error in the Windows Event Log service. Restore the database or file from a backup,
2011-06-24 09:54:27.690 spid22s      Error: 3414, Severity: 21, State: 2.
2011-06-24 09:54:27.690 spid22s      An error occurred during recovery, preventing the database 'DeferredTran' (database ID 22) from restarting. Diagnose the recovery errors and fix them, or restore from a known good backup. If errors are not corrected or expected, contact Technical Support.

Look at the part about log truncation – until we resolve the deferred transaction, it will hold the transaction log active and the log will grow and grow and grow… in this case, the database remains SUSPECT and unusable though, but if the transaction was deferred because of a restore issue, the database may be online and usable.

We can tell if there are deferred transactions using the sys.dm_tran_database_transactions DMV. A deferred transaction will have a transaction status with the 0x80000 bit set (undocumented).

SELECT * FROM sys.dm_tran_database_transactions'
WHERE [database_transaction_status] & 0x80000 = 0x80000;
GO

And if we look in sys.dm_tran_locks we’ll see that it’s holding locks that could cause our applications to block:

SELECT
    [request_session_id] AS [SPID],
    [resource_type] AS [LockType],
    DB_NAME ([resource_database_id]) AS [DB],
    [resource_description] AS [Resource],
    [resource_associated_entity_id] AS [ResourceID],
    [request_mode] AS [Mode],
    [request_status] AS [Status]
FROM sys.dm_tran_locks
WHERE [request_session_id] < 0;
GO
SPID LockType DB           Resource ResourceID        Mode Status
---- -------- ------------ -------- ----------------- ---- ------
-3   RID      DeferredTran 3:8:0    72057594038779904 X    GRANT
-3   PAGE     DeferredTran 3:8      72057594038779904 IX   GRANT
-3   OBJECT   DeferredTran          2105058535        IX   GRANT

Notice that there’s no database S lock as all regular connections have, as this SPID isn’t a regular connection. You won’t be able to see the SPID in any other way.

To recover from this we need to restore the offline file from a backup. We’ll also need to perform a tail-log backup so we can restore it to bring the restored file up-to-date.

BACKUP LOG [DeferredTran] TO DISK = N'C:\SQLskills\DeferredTran_log.bck';
GO
Msg 945, Level 14, State 2, Line 1
Database 'DeferredTran' cannot be opened due to inaccessible files or insufficient memory or disk space.  See the SQL Server errorlog for details.
Msg 3013, Level 16, State 1, Line 1
BACKUP LOG is terminating abnormally.

We have to use WITH NO_TRUNCATE…

BACKUP LOG [DeferredTran] TO DISK = N'C:\SQLskills\DeferredTran_log.bck'
WITH NO_TRUNCATE, INIT;
GO

RESTORE DATABASE [DeferredTran] FILE = N'FG_USB_1'
FROM DISK = N'C:\SQLskills\DeferredTran.bck'
WITH MOVE N'FG_USB_1' TO N'C:\SQLskills\FG_USB_1.ndf', NORECOVERY;
GO

RESTORE LOG [DeferredTran]
FROM DISK = N'C:\SQLskills\DeferredTran_log.bck'
WITH NORECOVERY;
GO

RESTORE DATABASE [DeferredTran] WITH RECOVERY;
GO;

And the output from the final RESTORE statement is:

1 transactions rolled back in database 'DeferredTran' (22). This is an informational message only. No user action is required.

RESTORE DATABASE successfully processed 0 pages in 0.006 seconds (0.000 MB/sec).

So there you are – an easy way to create deferred transactions and negative SPIDs to play with.

Enjoy!

MAXDOP configuration survey results

A month ago I kicked off a survey about MAXDOP setting – see here for the survey. I received results for 700 servers around the world! Here they are:

 

The X-axis format is X-Y-Z, where X = number of cores, Y = number of NUMA nodes, Z = MAXDOP setting. I didn't differentiate between hyper-threading or not, or soft vs. hard NUMA.

The results are striking – 75% of all systems out there use a server MAXDOP of zero. Now, this doesn't show whether individual queries are using MAXDOP to override that, or resource governor in 2008+ to override that either and I don't have info on the workload for all these servers – so it's not a result we can draw any concrete conclusions from. However, I do find it interesting that such a high proportion are running fine with MAXDOP 0 – my expectation was that there would be a higher proportion of servers with a non-zero MAXDOP setting.

There are quite a few 'black and white' configuration rules out there – for instance:

  1. Set MAXDOP to 1 if you're seeing CXPACKET waits as the prevalent wait type.

  2. Set MAXDOP to 1 for OLTP systems, and don't do anything else.

  3. Old Microsoft guidance to set MAXDOP to half the number of physical processors.

  4. Set MAXDOP to the number of cores in the NUMA node.

These are all incorrect as *rules*. There is no one-size-fits-all rule for MAXDOP – there are only general guidelines. For instance:

  • For OLTP systems, it can often be beneficial to set MAXDOP to 1 and then use the MAXDOP query hint to override the server-wide setting for queries that can benefit from parallelism.
  • For mixed-workload systems, you need to be careful how you set MAXDOP so you don't inadvertently penalize one of the workloads. Judicious use of the MAXDOP query hint can help here. A more powerful solution for mixed workloads is to use resource governor and have a workload group for each portion of the workload, with a different MAXDOP for each workload group.
  • For systems with high CXPACKET waits, investigate why this is the case before reducing MAXDOP. It's easy to come up with a demo where there are lots of CXPACKET waits, and while reducing MAXDOP (for the server or the query) reduces the CXPACKET waits, it also makes the query take a lot longer. CXPACKET waits can be because the statistics are incorrect and the query execution system divides up the work by the out-of-date statistics
  • Consider using the cost threshold for parallelism setting – see Jonathan's recent post here.

Using the resource governor as I described above can be a very easy way to mess around with the MAXDOP setting – especially for applications with legacy code that you cannot change, and you don't want to set a server-wide MAXDOP setting. This even works if the legacy code uses MAXDOP query hints, because the resource governor workload group MAXDOP setting *cannot* be overridden.

The key point when making any change for performance tuning is to test the change before putting it into production and work out which setting works best for your workload on your system – rather than blindly following 'this is the best way' rules for settings that people publish.

In other words, what should your MAXDOP be set to? It depends! :-)

One thing to be aware of: if you change the MAXDOP setting for the server, it will flush the plan cache when you do. It shouldn't, but it does. Be careful when doing this on a production server…

Thanks to all those who replied to the original survey!