SQLskills SQL101: NUMA Support in SQL

I’ve blogged about NUMA and SQL Server in the past and how SQLOS automatically handles NUMA configurations for SQL Server by default, meaning typically as a DBA you don’t need to customize any configuration options for SQL Server to optimize it’s internal structures for NUMA systems.  I’ve also talked about vNUMA issues in virtual machines with Hot-Add CPU enabled and how that affects performance.

While speaking at the SQLintersection conference this week an attendee came up to me and asked if NUMA support in SQLOS was Edition specific in SQL Server.  It’s not an Edition-specific feature, SQLOS configures the internal structures of SQL Server when it boots based on the hardware NUMA layout being presented by the Windows OS, regardless of the Edition of SQL Server that is running.  The attendee then pulled up a blog post recently published that stated that “NUMA awareness is an Enterprise Edition feature” in SQL Server; unfortunately that blog post (extract below) is incorrect.

image

With anything on the internet, including this blog, you need to verify the information before you trust it if actual tests and verification aren’t provided.  Even if the information provides a demonstration or reproduction that shows a particular behavior, you need to look at the version and time-frame for when the information was published because things change constantly in technology and what was accurate a year ago may not be accurate today.  So to demonstrate that SQL Server Standard Edition will recognize NUMA configurations, here are two screenshots from one of our lab servers at SQLskills.

image

image

As you can see, this instance sees NUMA and has two NUMA nodes configured inside of SQLOS running Standard Edition.  NUMA is not an Edition-specific feature in SQL Server and never has been.  SQLOS optimizes the way the internal structures are created under NUMA for memory nodes (and CPU nodes) based on how the OS is presenting the hardware layout.

Faking Hardware NUMA on a SMP Machine for Learning and Demos

I’ve blogged about NUMA a couple of times in the past and we cover it extensively in our Immersion Events, specifically when discussing SQLOS and memory and scheduling management.  One of the challenges with this is that laptops are SMP machines, so unless you have a server with hardware NUMA it is impossible to show the differences in how SQLOS boots up with or without NUMA, at least that’s what I used to think.  It turns out that in Windows 7 and Windows Server 2008 R2, the support for > 64 processor support and processor groups allows you to be able to fake a hardware NUMA configuration at the OS level for testing purposes.

To show how this works I am going to use one of my test VMs running in VMware Workstation 8 on my Dell M6700 laptop that I use regularly for teaching and presentation demos.  For the purposes of this post I have reconfigured the VM to have four CPUs with two cores each and 16GB RAM.

SMP Configuration (Default)

Under the default configuration of the VM has a single memory node presented by Windows to SQLOS so it starts up with a single NUMA node and all of the logical processors are assigned to that node, as shown in the Server Properties dialog below.

image

We can also see this information from the DMVs with the following query:

SELECT
    mn.processor_group
    , mn.memory_node_id
    , n.node_id AS cpu_node_id
    , n.online_scheduler_count
    , n.cpu_affinity_mask
    , n.online_scheduler_mask
FROM sys.dm_os_memory_nodes AS mn
INNER JOIN sys.dm_os_nodes AS n
    ON mn.memory_node_id = n.memory_node_id;

image

NUMA Configuration (bcdedit)

A fake NUMA configuration can be created by setting the groupsize value in the Boot Configuration Data (BCD) using BCDEdit to the number of CPUs that should be grouped into each processor group. The appropriate values are 1 to 64 in powers of 2. This isn’t documented in the standard BCDEdit documentation in the Books Online, but it is documented in the Windows Drivers development section.

BCDEdit /set (Windows Drivers)

To use BCDEdit you have to be running from a elevated command prompt as a local administrator.  For the purposes of this post I am going to use a groupsize of 2.

image

After changing the BCD, it is necessary to restart the machine for the configuration changes to take effect to the configuration.

Disclaimer:  This is not something you should do on a production SQL Server and this is only something that should be used for testing and/or learning about SQLOS on a non-NUMA hardware machine.

After rebooting the server, if we look at the Server Properties in, we will have four NUMA nodes, each with two logical processors assigned to it.

image

Looking at the DMVs in SQL Server will also show the changes to how SQLOS started up and configured the instance.

image

Additionally, the Performance Counters in PerfMon for the Buffer Nodes we can see memory usage information per node:

image

Keep in mind this is all false information, but you can really learn a lot about SQL Server and perform tests of behaviors under NUMA using this method.

SQL Server and Soft NUMA

Friday of last week, Steve Jones (Blog|Twitter) asked a question on the Twitter #sqlhelp tag about SQL Server and Soft NUMA that prompted me to write the long waiting follow up to my first post Understanding Non-Uniform Memory Access/Architectures (NUMA). The question that Steve asked was:

@way0utwest: “Do you always have one soft NUMA node? Say a one CPU PC or SMP architecture? #sqlhelp”

The whole start of my real work looking into the different NUMA configurations was actually around how to configure SQL Server for Soft NUMA so that I would see the benefits that are mentioned in two of the topics in the Books Online. The Understanding Non-uniform Memory Access topic states (or at least at the time I wrote this it stated):

"The benefits of soft-NUMA include reducing I/O and lazy writer bottlenecks on computers with many CPUs and no hardware NUMA. There is a single I/O thread and a single lazy writer thread for each NUMA node. Depending on the usage of the database, these single threads may be a significant performance bottleneck. Configuring four soft-NUMA nodes provides four I/O threads and four lazy writer threads, which could increase performance."

I remembered reading something along these lines after recently looking at a number of client servers that had 24 cores and up to 128GB RAM installed in them but were not hardware NUMA configurations. Having the extra lazy writer threads sounded like a good thing so I went about trying to figure out how to setup Soft NUMA which lead me to the BOL Topic How to: Configure SQL Server to Use Soft-NUMA which provided the following scenario:

“Instance A, which experiences significant I/O, now has two I/O threads and two lazy writer threads, while instance B, which performs processor-intensive operations, has only one I/O thread and one lazy writer thread. "

Perfect this is exactly what I was thinking, so now off to do some testing on one of my VM’s, and this is where I hit my first bump. I needed to figure out the appropriate CPUMask values for a 24 core system to subdivide it into four Soft NUMA nodes of 6 cores each, but the Books Online, and the available blog posts for the topic don’t make that very easy to accomplish and they don’t explain the basis of the CPUMask very well either. I was able to figure out with some testing, that the CPUMask is just a bitmask of the CPUMask values of the CPU’s that contribute to Soft NUMA node. Clear as mud right? 

To make this a bit easier, I am going to paste some tables that I created in Excel below that demonstrate what I am talking about, but for a 8 core example and not the 24 core solution I was originally working on. The first thing you have to keep in mind is that CPU’s are numbered with a CPUID starting with a base of zero, and each CPU has its own mask value that is two to the power of the CPUID. For the example in the How to: Configure SQL Server to Use Soft-NUMA Books Online Topic, this works out logically to the table below:

CPUIDMaskSoft NodeCPU BitMaskNode Registry CPUMask
01130x03
12
242120x0C
38
41632400xF0
532
664
7128

The CPU Mask Sum column is the sum of the Masks for each of the CPU’s in the Soft Node and then the Node Registry CPUMask is the hexadecimal representation of the decimal value. This can be obtained using the DEC2HEX() function in Excel, or by placing the Windows Calculator into Programmer Mode from the File Menu.

image  image

So now that we have an easier to understand example covered, here is what the CPUMask for a 24 core server with four 6 core Soft NUMA Nodes would be:

CPUIDMaskSoft NodeCPU BitMaskNode Registry CPUMask
011630x3F
12
24
38
416
532
664240320xFC0
7128
8256
9512
101024
112048
12409632580480x3F000
138192
1416384
1532768
1665536
17131072
182621444165150720xFC0000
19524288
201048576
212097152
224194304
238388608

To test this configuration out, I went to one of my SQL Server 2008R2 VM’s on my laptop and made use of the –Pn startup trace flag which tells SQL Server to create n schedulers during the SQLOS boot up even if the actual hardware does not exist for those schedulers.  This is an undocumented startup parameter and it should NEVER BE USED in a production server.  Just because the SQLOS creates additional schedulers, doesn’t mean that you physically have additional processors in the server, and if you want to see how bad this startup parameter can cause performance to drag, create 24 schedulers on your laptop and then run a couple of concurrent queries in SQL Server.  After adding ;-P24 to startup parameters and restarting my SQL instance I now have 24 schedulers in sys.dm_os_schedulers as shown below:

image

Now all I have to do is add the registry keys to create my Soft NUMA nodes. At this point I realized that I had really complicated figuring out the hexadecimal CPUMask values. The DWORD registry value accepts the Decimal or Hexadecimal value in regedit, so you can skip converting the CPU BitMasks to hexadecimal, if you ever actually configure Soft NUMA for one of your servers.

image

After setting all of the registry keys for my instance, and then restarting my instance I went back in to sys.dm_os_nodes to make sure that my NUMA configuration changes were correctly made.

image

Everything looks perfect, I now have four Soft NUMA nodes with 6 schedulers in each node, so lets go look at our new Lazy Writer Threads in sys.dm_exec_requests:

image

This is only a subset of the system sessions that actually exist on the instance but trust me when I tell you, all of the Lazy Writer threads are shown above; there is only one.  So what happened here? I have four Soft NUMA nodes, we confirmed that so why don’t I have any additional Lazy Writer threads?  I was certain I had done something wrong here, so I sent tweet to Amit Banerjee (Blog|Twitter) a good friend of mine from the MSDN Forums who happens to be on the Product Support Team in India. It turns out that the Books Online is wrong about the benefits of using Soft NUMA, and this was pointed out a few years ago by Bob Dorr, another escalation engineer in Product Support on his blog post How It Works: Soft NUMA, I/O Completion Thread, Lazy Writer Workers and Memory Nodes. Two key points are made by Bob in this blog post; first that Lazy Writers are only assigned to Hard NUMA nodes, which map to SQLOS memory nodes, and second the IO Completion threads that are created per Soft NUMA node have nothing to do with Disk I/O, but are instead for connection requests and TDS traffic over the network.

Based on this, I filed a connect feedback for the incorrect entries in the Books Online for Soft NUMA that will eventually see this incorrect information corrected. At the point that I realized this, I had just over four hours of time dedicated to figuring out how to make something work that will never actually do what I expected based on the documentation, but I had a much better understanding of how NUMA actually worked inside of SQL Server, which is something that had up to this point been a big mystery to me.

This leaves open the question, when would you actually make use of Soft NUMA in SQL Server, if it doesn’t actually provide CPU to Memory masking like Hard NUMA does, or provide any benefits of having additional Lazy Writer Threads?  One use case would be to provide port affinity to drive specific connections to a specific node as discussed in the How It Works: SQL Server 2005 NUMA Basics blog post, also by Bob Dorr, but this only applies to serial execution plans and not to parallel execution plans as detailed by Bob in the NUMA Connection Affinity and Parallel Queries blog post. An example of where Soft NUMA usage is the ETL World Record set by SQL Server back in February 2008 which used Soft NUMA to balance the connection load as a part of achieving the world record of 1.18TB of flat file data loaded in 1794 seconds.