Survey results on rebooting – is it good or bad?

About a month ago I kicked off a survey around regularly rebooting SQL Server (see here). There is much debate about rebooting SQL Server (or Windows Server hosting SQL Server), regularly or even at all so I've been looking forward to the results.

My view is that there are circumstances where rebooting, even regularly, is acceptable – but you have to have a good reason to do it because of the downsides of restarting the SQL Server process. These downsides include:

  • The buffer pool goes completely cold. Warming up again requires reading (potentially a lot of) data back into memory, after allocating that memory from Windows.
  • All query plans are lost and will need to be compiled again.
  • All information about the state of SQL Server, used to give DMV output, is lost.

In other words, the steady-state for the production workload is lost and has to be attained again.

As you'll see from the results, the most common reason for a regular reboot is part of the Windows/SQL patching process, but there are plenty of reasons cited that are NOT valid in my opinion, and some others that are.

 

The "Other" responses are:

  • 101 x (paraphrasing) "After windows/SQL/firmware updates that require a reboot."
  • 30 X (paraphrasing) "Quarterly."
  • 10 x "Only if any problems occur or any maintenance activity scheduled in off hours."
  • 5 x "Only if I've observed some resource contention after some time. Usually a result of not being able to implement the solution and having to work around the problem."
  • 4 x "Every two months."
  • 1 x "I normally go 360+ days between reboots. And then it's just for giggles. No indication it is needed or required."
  • 1 x "I would like to do it monthly but my boss doesn't support it."
  • 1 x "Yes, every two weeks."
  • 1 x "Yes, twice a week."

 

The "Other" responses are (I've commented on those in bold):

  • 74 x (paraphrasing) "MS security patches – manually installed after testing them on duplicate system."
  • 67 x (paraphrasing) "As part of the monthly MS patching."

  • 17 x (paraphrasing) "No regular reboots."
  • 9 x "We reboot our SQL servers about twice a year average for updates or other issues."
  • 6 x "As required for patching activities and only after ensuring that the potential for patching has been reduced by eliminating non-required components and programs."
  • 4 x "After patch installation or maintenance such as password updates to test cluster failover."
  • 3 x "Process unkilled and some memory problems."
  • 2 x (paraphrasing) "Because the business demands it, for no valid reason."
    • You need to work out why that process was put in place. My guess is that it's based on historical requirements to do so on older operating systems, or because of bugs, and it's just stuck. Very likely the process can be removed if the reboot schedule is very frequent (e.g. daily or weekly).
  • 1 x "I have issues with a CLR process."
    • Then try to fix that process/code rather than resorting to a reboot.
  • 1 x "I've been told that certain errors won't be reported by Windows without a reboot."
  • 1 x "Last rebooted about 50 days ago – only reboot when necessary (patches)."
  • 1 x "Microsoft Insists."
    • I don't buy this one, unless it's to work around a specific bug that will be patched at some point.
  • 1 x "Nightly reboots of a specific SQL Server are required by the vendor whose app it supports."
    • This is badly messed up and you need to push back on the vendor. My guess here is that they have a leaky application (i.e. memory leaks) and this is their solution.
  • 1 x "Non Windows app memory leaks."
    • Again, try to fix the memory leaks rather than resorting to rebooting.
  • 1 x "Our application becomes unstable otherwise."
    • Fix the application…
  • 1 x "Regular test of clusters AND SQL Server performance degrades without restart."
  • 1 x "Save power, turn it off at night."
    • This is interesting and one I haven't considered before. Are you sure the gain in power saving is worth the downsides I mention at the start of the post?
  • 1 x "SQL Server connection time out error."
  • 1 x "VMware Snapshot."
    • This I don't know about (that's why I employ Jonathan :-), but I'd be surprised if this process requires the server to be rebooted.
  • 1 x "We do full server backups weekly. we briefly shut SQL, copy c drive where OS and SQL software live – paging file etc. then bring SQL up and finish server full backup. This takes approx 5 min per week."
    • This is messed up. SQL Server and Windows can both be backed up without having to be shut down.
  • 1 x "We have a weekly Maintenance window and it keeps the event logs manageable."
    • This is messed up. Windows and SQL logs can be cycled without rebooting.

Quite interesting results.

There have been many problems of the years which have required regular reboots to clear, but these are *mostly* gone with the advent of Windows Server 2008/2008 R2. I think there are three main reasons which still are valid for rebooting regularly:

  1. As part of Windows/SQL/firmware patch installs.
  2. To test failover procedures.
  3. To work around a Windows or SQL bug.

As an example of #3, there's a problem still in Windows Server 2008 with the file cache taking up all the memory, depending on which APIs are used to call it (see KB 976618). And what about the various SQL Server memory leak bugs over the years (e.g. KB 959007).

There are plenty of other symptoms that can be "fixed" with a server reboot. A great example of this is plan cache bloat from lots of plans for ad-hoc SQL queries. This can eat away at the memory available for the buffer pool to store data and cause the server performance to grind down (see Kimberly's post Plan cache, adhoc workloads and clearing the single-use plan cache bloat). And what about buffer pool memory being wasted by massive index fragmentation, resulting in poor performance (see my post Performance issues from wasted buffer pool memory). Or any of the reasons in bold above that mention and application instability, memory leak, or issue?

For these problems, rebooting the server is addressing the symptom of the problem, not the root-cause. We've had several clients in the last month where massive performance problems were being "solved" by failing over to the other cluster node, and they turned out to be caused from not having Lock Pages In Memory set (see KB 918483, and take care to note all the Windows bugs it explains that can cause working set trimming too). They were addressing the symptom without researching the cause.

Bottom line: if you're regularly rebooting Windows/SQL Server, make sure it's for a good reason and not just because someone thinks it's a good thing to do or it's the chosen way to fix a problem that should be fixed in some other way.

And don't immediately assume that someone who does reboot Windows/SQL Server regularly is an idiot – they may have a very good reason, or may just be ignorant… (see Ignorance is not stupidity).

Thoughts?

8 thoughts on “Survey results on rebooting – is it good or bad?

  1. For the VMWare snapshot, a reboot shouldn’t be required for this but we have seen some issues around Veeam backups which use snapshots that we haven’t been able to completely resolve yet.

  2. VMware snapshots. I’ll bet they are using it as a backup/point in time snapshot of the whole system. I alluded to this in our brief email exchange last week. In order to have a "consistent" image of SQL Server, you must shutdown the OS before taking a VMware snapshot (assuming vendor extensions aren’t being used). Otherwise you’ll have a crash-state "backup" and there is also the very real potential of a corrupt image. Normally the snapshot tool will "freeze" the system in order to take the snapshot. However the image of both the OS and SQL Server reflect the current CPU running state – and for SQL Server there are MDF/LDF issues to deal with which are dependent upon configuration and snapshot options. Therefore it is recommended that you shut the system down and take the snapshot.

    Of course, people can purchase vendor software that plugs into SQL Server and assists with the snapshot – this allows for a "live" snapshot. This is provided either by the SAN vendor, or various VMware VARs. Plus there is CLI support available within VMware and samples for the DYI crowd. Several SAN vendors have snapshot software that plugs into both VMware and SQL Server to help make backups (and "replicated" copies) painless. I’m most familiar with NetApp, but know that both IBM and (of course) EMC have something too. These vendor products tend to have incremental costs – but they take the hassle out of managing snapshots/backups.

    There are more options and capabilities in the VMware & SAN world, use of them with SQL Server offers a new host of problems to think about. Done wrong, your "backup" may not contain the data you want. Done right, and backups are faster, and recovery easier.

    -Mike

  3. I know a large multi-national that has a standard, regular reboot policy for their Application and SQL Servers. The argument for it goes a little like this.

    "Over the years, there have been different bugs in Windows Server or SQL Server or other Server Applications, which can all be solved by a server reboot. It is easier to achieve our SLAs with the business by doing a bi-monthly reboot, rather than hitting an issue, tracking down the root cause and implementing a specific solution.

    With the regular reboot, we may have (potential) problems that never get a chance to impact the business.

    If we have an issue with a server / application requires a more frequent reboot, or the SLA means we can’t do the regular re-boot, then we can investigate that particular issue."

    With the number of SQL Servers and different applications the business uses, I can appreciate this approach. My inner-geek isn’t happy, but it’s the goal is to provide the best service for the business.

  4. A colleague of mine managed a Windows 2003 box whose IPSEC stack had become corrupted some time in the 6 months following it’s last reboot. Bounced the box after some maintenance and it never came back up. He had to drive to the site and pop a recovery CD in.

    So now our servers have regular reboots at night so we catch these issues outside of business hours. As a consequence, SQL Server gets regularly rebooted too.

  5. I once worked at a client site where they rebooted the sql server once a week. I queried the server admin as to why this was done, the reason was because at least once a week the whole server would stop responding (stop accepting RDP, SQL Server would not accept connections, etc). A quick look through the event logs led me to the McAfee website for their VSE 8.0 product. Applying the patch for a memory leak issue, this resulted in no further problems. Simple detection of a problem and the application of a patch was all that was required.

    With regards to the snapshotting, i don’t agree that you have to shut down the sql server to take a snapshot (same as a reboot anyway isn’t it?). VMWare may recommend it but i’ve never seen it as an absolute given. The VMWare tools quiescse driver has had its problems over the time but at another client where I worked on a project and virtualised a whole farm of sql servers we used VMWare VCB and Legato (now EMC) Networker for VM full and delta backups. VCB works buy snapshotting the running VM and quiescing the file system and server calls to an undo file. When VCB removes its snapshot the data is rolled into the VM’s file system and all changes become permanent. With VDR now replacing VCB i expect the VM backup process to become even more flexible than it was previously.

    Regards

    Perry

  6. I restart on every software release: Two reasons- 1) to run on alternate cluster nodes ( Proves to Upper Management we can run on either node ) and to clear the dmvs which makes finding issues easier if the new release introduces something nasty.
    If patching requires a restart then I have to restart; other than that I will not restart a server unless I really have to.

  7. We reboot every two weeks during our Miantenance Window, if we can, otherwiase on the next Maintenance. We do Cold backup snaps of our Oracle Databases and MS patches require reboots most of the time. I do not trust MS Windows to run more that a month, as we have so much crap running on these system by the Air Force, that some things just quit working or do not work as well. Examples: McAfee software causes issues; Symantec causes issues; Remote desktop disconnects for no reason.

  8. This issue always confuses me. I understand the implications of restarting (buffer pool, cache plan, etc) however, one of the most basic mantras of IT security is keeping machines patched.

    MS updates come out every month – it seems to follow servers should be rebooted every month. I can see maybe always being one month behind if you want to test the patches first on a dev server. Though, frankly, when we say "test" on a dev server, I’m curious what percentage of folks do anything more than apply the updates and see if anything crashes. Are we really reading the security bulletins, selectively determining what to apply, then comparing performance to our baseline? That alone would be a full time job.

    Ten minutes downtime a month is still 0.99977 uptime for the year. (Understanding that for a period of time that next day you’ll have reduced performance as the buffer pool and cache plan repopulate.)

    -Henry

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Some thoughts on courage

(This is also the Ponderings – editorial – in today’s SQLskills newsletter.) I want to start out this post by sincerely thanking everyone who emailed

Explore

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.