Problems from having lots of server memory

A month ago I kicked off a survey asking how much memory is installed on your largest server that’s running SQL Server. Thank you to everyone that responded.

Here are the results:

memory

 

 

 

 

 

 

 

 

 

 

 

The “other” values are:

  • 3 more for the ‘128 GB or more, but less than 256 GB’ count
  • 1 more for the ‘Less than 16 GB’ count
  • One poor soul who only has 512 MB in their server!

This is very interesting:

  • I expected the majority of servers to fall into the middle of the range (around 128GB), and it’s actually only 37% that fit into the 64 GB to 256 GB range.
  • I’m surprised at the percentage of servers (41%) of servers with 256 GB or more.
  • I didn’t know what percentage would have more than 1 TB, so almost 10% is really cool to see.

So what do these results mean? Well, the number of servers out there with lots (more than 128GB) of memory is more than half of all respondents. The more memory you have, the more important it is that you make sure that the memory is being used efficiently and that you’re not wasting space in the buffer pool (see here) and that you’re not churning the buffer pool with poor query plans causing lots of reads (see here).

What other things could be problems with large amounts of memory?

  • Shutting down the instance. This will checkpoint all the databases, which could take quite a long time (minutes to hours) if suddenly all databases have lots of dirty pages that all need to be flushed out to disk. This can eat into your maintenance window, if you’re shutting down to install an SP or a CU.
  • Starting up the instance. If the server’s POST checks memory, the more memory you have, the longer that will take. This can eat into your allowable downtime if a crash occurs.
  • Allocating the buffer pool. We’ve worked with clients with terabyte+ buffer pools where they hit a bug on 2008 R2 (also in 2008 and 2012) around NUMA memory allocations that would cause SQL Server to take many minutes to start up. That bug has been fixed in all affected versions and you can read about in KB 2819662.
  • Warming up the buffer pool. Assuming you don’t hit the memory allocation problem above, how do you warm up such a large buffer pool so that you’re not waiting a long time for your ‘working set’ of data file pages to be memory resident? One solution is to analyze your buffer pool when it’s warm, to figure out which tables and indexes are in memory, and then write some scripts that will read much of that data into memory quickly as part of starting up the instance. For one of the same customers that hit the allocation bug above, doing this produced a big boost in getting to the steady-state workload performance compared to waiting for the buffer pool to warm up naturally.
  • Complacency. With a large amount of memory available, there might be a tendency to slacken off proactively looking for unused and missing index tuning opportunities or plan cache bloat or wasted buffer pool space (I mentioned above), thinking that having all that memory will be more forgiving. Don’t fall into this trap. If one of these things becomes such a problem that it’s noticeable on your server with lots of memory, it’s a *big* problem that may be harder to get under control quickly.
  • Disaster recovery. If you’ve got lots of memory, it probably means your databases are getting larger. You need to start considering the need for multiple filegroups to allow small, targeted restores for fast disaster recovery. This may also mean you need to think about breaking up large tables, using partitioning for instance, or archiving old, unused data so that tables don’t become unwieldy.

Adding more memory is one of the easiest ways to alleviate some performance issues (as a band-aid, or seemingly risk-free temporary fix), but don’t think it’s a simple thing to just max out the server memory and then forget about it. As you can see, more memory leads to more potential problems, and these are just a few things that spring to mind as I’m sitting in the back of class here in Sydney.

Be careful out there!

 

Database mirroring: avoiding ‘cannot obtain a LOCK resource’ problems

This is an interesting case that came up recently, and that I’ve seen a few times before, but which I haven’t seen explained anywhere.

Using database mirroring, you may see 1204 and 1454 errors on the mirror server with no other databases or activity on the mirror server apart from database mirroring:

2013-01-15 12:00:00.410 spid29s Error: 1204, Severity: 19, State: 4.
2013-01-15 12:00:00.410 spid29s The instance of the SQL Server Database Engine cannot obtain a LOCK resource at this time. Rerun your statement when there are fewer active users. Ask the database administrator to check the lock and memory configuration for this instance, or to check for long-running transactions.
2013-01-15 12:00:00.430 spid29s Error: 1454, Severity: 16, State: 1.
2013-01-15 12:00:00.430 spid29s Database mirroring will be suspended. Server instance 'instancename' encountered error 1204, state 4, severity 19 when it was acting as a mirroring partner for database 'dbname'. The database mirroring partners might try to recover automatically from the error and resume the mirroring session. For more information, view the error log for additional error messages.

With no activity on the mirror server apart from mirroring, why are locks being taken?

The answer is that locks must be acquired when a transaction is rolled back on the mirror and is just how logging and recovery work. I discussed this some in my post Lock logging and fast recovery back in 2009 – locks are acquired during rollback to ensure that other transactions don’t prevent rollback from completing. The behavior is the same when a mirrored transaction rolls back – the locks need to be acquired on the mirror before the rollback can commence.

You can easily see this for yourself. Create a mirroring session and then a constant insert workload of small transactions. If you look at the locks on the mirror server using sys.dm_tran_locks, you’ll see no locks being acquired in the mirror database. Now create a transaction on the principal server that inserts 10,000 records. Roll it back and quickly examine the locks on the mirror server – you’ll see thousands of locks being acquired in the mirror database.

Why can we see 1204 errors with database mirroring?

Each lock structure takes up a small amount of memory and the more locks that are acquired and held at any one time, the more memory is required for the lock structures. On a memory-constrained system it’s possible to run out of lock memory and result in error 1204.

With database mirroring, if the principal server has a lot more memory than the mirror server, and there are a lot of large rollbacks occurring, or there a multiple mirroring sessions where rollbacks are occurring, it’s possible that the mirror server will run out of lock memory, resulting in errors 1204 and 1454 as above.

How can we stop 1204 errors occurring?

The easiest fix for this (and to hopefully prevent it happening) is to ensure that the mirror server has the same amount of memory as the principal server – so the principal should not be able to create more locks than the mirror server can accommodate. Even then, you may still see these errors if there is a lot of rollback activity across mirrored databases and it all happens to be replayed on the mirror server at the same time – in which case you need more memory on the mirror server than on the principal.

You need to also ensure that nothing else is using memory on the mirror server – e.g. using database snapshots, or any other activity apart from just mirroring.

I hope this explanation helps prevent some of you out there wasting time trying to troubleshoot this issue.

Max server memory configuration survey results

A long time ago, in a galaxy far, far away I kicked off a survey about memory configuration. Actually it was back at the start of January and I've been terribly remiss about posting the survey results!

I was interested in how the setting of Max Server Memory (which controls the maximum size of the buffer pool) related to the physical memory available on the server.

Thanks to the people who sent me data from 525 servers worldwide.

Here is the data, presented in two charts.

Firstly, when Max Server Memory is actually set:

 

I had a few data points at the 512GB and 768GB sizes, and their Max Server Memory settings were all valid.

What's interesting in this graph is the wide variety of Max Server Memory settings for any specific amount of physical server memory.

Rather than me explaining how you figure out how much physical memory to reserve for the operating system and other SQL Server memory uses, Jonathan just posted an explanation and loose formula over on his blog – so I'll point you there.

There were a disturbingly large number of SQL Servers that did *not* have Max Server Memory set at all:

 

These systems may suffer performance problems when the operating system has to pressure the SQL Server buffer pool to give back some memory – it's always better to set a Max Server Memory value – again, see Jonathan's post.

The large spike at 8GB in the graph above is because one person sent me a few hundred sets of results for 8GB servers without Max Server Memory set.

Here's the same set of results without the 8GB spike:

 

Quick summary: make sure you have an appropriate Max Server Memory setting for your servers to avoid performance problems.