Friday of last week, Steve Jones (Blog|Twitter) asked a question on the Twitter #sqlhelp tag about SQL Server and Soft NUMA that prompted me to write the long waiting follow up to my first post Understanding Non-Uniform Memory Access/Architectures (NUMA). The question that Steve asked was:

@way0utwest: “Do you always have one soft NUMA node? Say a one CPU PC or SMP architecture? #sqlhelp”

The whole start of my real work looking into the different NUMA configurations was actually around how to configure SQL Server for Soft NUMA so that I would see the benefits that are mentioned in two of the topics in the Books Online. The Understanding Non-uniform Memory Access topic states (or at least at the time I wrote this it stated):

"The benefits of soft-NUMA include reducing I/O and lazy writer bottlenecks on computers with many CPUs and no hardware NUMA. There is a single I/O thread and a single lazy writer thread for each NUMA node. Depending on the usage of the database, these single threads may be a significant performance bottleneck. Configuring four soft-NUMA nodes provides four I/O threads and four lazy writer threads, which could increase performance."

I remembered reading something along these lines after recently looking at a number of client servers that had 24 cores and up to 128GB RAM installed in them but were not hardware NUMA configurations. Having the extra lazy writer threads sounded like a good thing so I went about trying to figure out how to setup Soft NUMA which lead me to the BOL Topic How to: Configure SQL Server to Use Soft-NUMA which provided the following scenario:

“Instance A, which experiences significant I/O, now has two I/O threads and two lazy writer threads, while instance B, which performs processor-intensive operations, has only one I/O thread and one lazy writer thread. "

Perfect this is exactly what I was thinking, so now off to do some testing on one of my VM’s, and this is where I hit my first bump. I needed to figure out the appropriate CPUMask values for a 24 core system to subdivide it into four Soft NUMA nodes of 6 cores each, but the Books Online, and the available blog posts for the topic don’t make that very easy to accomplish and they don’t explain the basis of the CPUMask very well either. I was able to figure out with some testing, that the CPUMask is just a bitmask of the CPUMask values of the CPU’s that contribute to Soft NUMA node. Clear as mud right? 

To make this a bit easier, I am going to paste some tables that I created in Excel below that demonstrate what I am talking about, but for a 8 core example and not the 24 core solution I was originally working on. The first thing you have to keep in mind is that CPU’s are numbered with a CPUID starting with a base of zero, and each CPU has its own mask value that is two to the power of the CPUID. For the example in the How to: Configure SQL Server to Use Soft-NUMA Books Online Topic, this works out logically to the table below:

CPUID Mask Soft Node CPU BitMask Node Registry CPUMask
0 1 1 3 0x03
1 2
2 4 2 12 0x0C
3 8
4 16 3 240 0xF0
5 32
6 64
7 128

The CPU Mask Sum column is the sum of the Masks for each of the CPU’s in the Soft Node and then the Node Registry CPUMask is the hexadecimal representation of the decimal value. This can be obtained using the DEC2HEX() function in Excel, or by placing the Windows Calculator into Programmer Mode from the File Menu.

image  image

So now that we have an easier to understand example covered, here is what the CPUMask for a 24 core server with four 6 core Soft NUMA Nodes would be:

CPUID Mask Soft Node CPU BitMask Node Registry CPUMask
0 1 1 63 0x3F
1 2
2 4
3 8
4 16
5 32
6 64 2 4032 0xFC0
7 128
8 256
9 512
10 1024
11 2048
12 4096 3 258048 0x3F000
13 8192
14 16384
15 32768
16 65536
17 131072
18 262144 4 16515072 0xFC0000
19 524288
20 1048576
21 2097152
22 4194304
23 8388608

To test this configuration out, I went to one of my SQL Server 2008R2 VM’s on my laptop and made use of the –Pn startup trace flag which tells SQL Server to create n schedulers during the SQLOS boot up even if the actual hardware does not exist for those schedulers.  This is an undocumented startup parameter and it should NEVER BE USED in a production server.  Just because the SQLOS creates additional schedulers, doesn’t mean that you physically have additional processors in the server, and if you want to see how bad this startup parameter can cause performance to drag, create 24 schedulers on your laptop and then run a couple of concurrent queries in SQL Server.  After adding ;-P24 to startup parameters and restarting my SQL instance I now have 24 schedulers in sys.dm_os_schedulers as shown below:

image

Now all I have to do is add the registry keys to create my Soft NUMA nodes. At this point I realized that I had really complicated figuring out the hexadecimal CPUMask values. The DWORD registry value accepts the Decimal or Hexadecimal value in regedit, so you can skip converting the CPU BitMasks to hexadecimal, if you ever actually configure Soft NUMA for one of your servers.

image

After setting all of the registry keys for my instance, and then restarting my instance I went back in to sys.dm_os_nodes to make sure that my NUMA configuration changes were correctly made.

image

Everything looks perfect, I now have four Soft NUMA nodes with 6 schedulers in each node, so lets go look at our new Lazy Writer Threads in sys.dm_exec_requests:

image

This is only a subset of the system sessions that actually exist on the instance but trust me when I tell you, all of the Lazy Writer threads are shown above; there is only one.  So what happened here? I have four Soft NUMA nodes, we confirmed that so why don’t I have any additional Lazy Writer threads?  I was certain I had done something wrong here, so I sent tweet to Amit Banerjee (Blog|Twitter) a good friend of mine from the MSDN Forums who happens to be on the Product Support Team in India. It turns out that the Books Online is wrong about the benefits of using Soft NUMA, and this was pointed out a few years ago by Bob Dorr, another escalation engineer in Product Support on his blog post How It Works: Soft NUMA, I/O Completion Thread, Lazy Writer Workers and Memory Nodes. Two key points are made by Bob in this blog post; first that Lazy Writers are only assigned to Hard NUMA nodes, which map to SQLOS memory nodes, and second the IO Completion threads that are created per Soft NUMA node have nothing to do with Disk I/O, but are instead for connection requests and TDS traffic over the network.

Based on this, I filed a connect feedback for the incorrect entries in the Books Online for Soft NUMA that will eventually see this incorrect information corrected. At the point that I realized this, I had just over four hours of time dedicated to figuring out how to make something work that will never actually do what I expected based on the documentation, but I had a much better understanding of how NUMA actually worked inside of SQL Server, which is something that had up to this point been a big mystery to me.

This leaves open the question, when would you actually make use of Soft NUMA in SQL Server, if it doesn’t actually provide CPU to Memory masking like Hard NUMA does, or provide any benefits of having additional Lazy Writer Threads?  One use case would be to provide port affinity to drive specific connections to a specific node as discussed in the How It Works: SQL Server 2005 NUMA Basics blog post, also by Bob Dorr, but this only applies to serial execution plans and not to parallel execution plans as detailed by Bob in the NUMA Connection Affinity and Parallel Queries blog post. An example of where Soft NUMA usage is the ETL World Record set by SQL Server back in February 2008 which used Soft NUMA to balance the connection load as a part of achieving the world record of 1.18TB of flat file data loaded in 1794 seconds.

Categories:
Internals | NUMA

Lately I have been doing a lot of work on SQL Server’s that have had 24 or more processor cores installed in them. Many of these systems utilize hardware non-uniform memory architectures, or NUMA, while a few of them were not. This led me into a good bit of additional research into the differences between the two configurations. In this blog post, we’ll look at what NUMA is, and why it matters, and then in a separate blog post we’ll look at what this means to SQL Server and the differences between hard and soft NUMA configurations specifically for SQL.

Only a few years ago NUMA was synonymous with specialized hardware configurations that were designed to scale beyond the processor and RAM limitations of traditional Symmetric Multi-Processing (SMP) systems. The first NUMA system I learned anything about was the IBM x3950 series of servers that were used in a PASS Keynote Demo in 2009. The server being used in the demo was actually four x3950 servers that were interconnected as hardware NUMA nodes using the ScaleExpander Option Kit sold by IBM. Each of the hardware NUMA nodes was a very basic 4-way SMP system with its own processors, memory and I/O devices that could have also operated as a stand alone server. In a SMP configuration, all of the processors have equal access to the memory and I/O in the system as shown in the following diagram.

image

While SMP systems made it possible to scale up the number of processors available inside a single server, as more processors were added to the system the processor bus became a limitation to the overall system performance. In the past systems like the IBM x3950 series of servers were able to scale up the number of processors and amount of RAM available in a single solution by interconnecting multiple SMP systems together with a high speed interconnect that allowed two or more systems to function as if they were a single system. However, one of the problems associated with connecting multiple nodes with an interconnect was the memory access between the processors in one node to the memory in another node was not uniform; hence the name Non-Uniform Memory Architecture. The interconnect between the two systems introduced latency for the memory access across nodes. The traditional concept of NUMA using multiple interconnected SMP systems as NUMA nodes would conceptually resemble the following:

image

Technically this is more complicated, but the general concept is that each of the nodes has access to the memory available in the total system through the interconnect, but at a slight penalty performance wise. For this reason special considerations had to be taken when writing software to run on NUMA configured systems and the Windows Server OS and SQL Server both have NUMA optimizations built into their code base to optimize memory locality. While in the past most people that were using hardware NUMA configurations would have known about it, recent changes in processor architectures have made hardware NUMA configurations much more common place and most people I’ve talked to don’t even realize that their new server utilizes hardware NUMA. 

AMD Hyper-Transport (HT)

Where the older SMP architecture had a separate memory controller in the Northbridge of the motherboard, newer systems have an integrated memory controller built into the processor itself, and each processor has its own memory bank. The first processors to introduce an integrated memory controller were the AMD Opteron series of processors in early 2003. Since the introduction of the Opteron series, other AMD processor lines have included an integrated memory controller as well. Just as with traditional NUMA configurations, the hardware NUMA created by each processor having an integrated memory controller requires an interconnect to exist between the processors to allow for memory access by the other processors in the system. AMD processors share memory access through Hyper-Transport (HT) links between the processors.

image

The HT interconnects between processors allows for the remote memory access in one CPU’s memory bank by the processor cores in a different CPU, but there is a cost for accessing this remote data. To determine the most recent copy of the data probes must be sent out to all of the CPU’s to determine if the current CPU has the most recent version of the data, and this increases latency. To combat this, AMD introduced HT Assist in the Istanbul line of processors which reduces the traffic by tracking data in the processor caches to help direct the processors to the correct location of the most recent copy of the data. This reduces the bus traffic and increases the efficiency of the CPU. 

Intel Quick-Path Interconnect (QPI)

As the number of cores on a single processor increased, Intel followed AMD and introduced an integrated memory controller on its Nehalem-EX series of processors that share memory access through Intel’s Quick-Path Interconnect (QPI). The Intel QPI interconnects the processors with each other in a similar manner to the AMD HT interconnect. However, the QPI snoop based cache forwarding implementation can return remote data in as little as 2 hops. (It should be noted that this in no way is an endorsement of Intel over AMD as there may be similar optimizations in AMD processors that I have not yet read about.) 

image

Intel offers a really detailed demo of their QPI architecture on their site that really helps understanding the differences between SMP and NUMA configurations (http://www.intel.com/technology/quickpath/demo/demo.htm).

Node Interleaving

On newer server systems based on processors that have integrated memory controllers, the option still exists to run the server in an SMP like configuration known as Node Interleaving. This option is generally configurable in the system Bios and causes the system to be treated as if it were a traditional SMP system by the Operating System and any applications that are NUMA aware and optimized. When node interleaving is enabled, the system becomes a Sufficiently Uniform Memory Architecture (SUMA) configuration where the memory is broken into 4KB addressable regions and mapped from each of the nodes in a round robin fashion so that the memory address space is distributed across the nodes. When the system is configured for NUMA, the memory from each node is mapped into a single sequential block of memory address space for all of the memory in each node. For certain workloads, node interleaving can be beneficial to performance since the latency between the hardware nodes of a system using processors with integrated memory controllers is small. However, for applications that are NUMA aware and that have NUMA optimizations, node interleaving should remain disabled to allow the application to make optimized use of the hardware resources.

Why does this matter?

I’ve been working with hardware NUMA servers, using the newer processors with integrated memory controllers, for the last two years, and not once did I really have to worry about whether the server was using NUMA or not. Microsoft has done a lot of work in how they built Windows Server and SQL Server to make working with NUMA configurations relatively pain free. However, recently I ran into some issues while working with client systems that made me have to look deeper into NUMA to gain an understanding of how it affected SQL Server, and what options were available for configuring SQL Server for NUMA. I’ll discuss the specifics of these in my next blog post in a few days.

Theme design by Nukeation based on Jelle Druyts