Friday of last week, Steve Jones (Blog|Twitter) asked a question on the Twitter #sqlhelp tag about SQL Server and Soft NUMA that prompted me to write the long waiting follow up to my first post Understanding Non-Uniform Memory Access/Architectures (NUMA). The question that Steve asked was:
@way0utwest: “Do you always have one soft NUMA node? Say a one CPU PC or SMP architecture? #sqlhelp”
The whole start of my real work looking into the different NUMA configurations was actually around how to configure SQL Server for Soft NUMA so that I would see the benefits that are mentioned in two of the topics in the Books Online. The Understanding Non-uniform Memory Access topic states (or at least at the time I wrote this it stated):
"The benefits of soft-NUMA include reducing I/O and lazy writer bottlenecks on computers with many CPUs and no hardware NUMA. There is a single I/O thread and a single lazy writer thread for each NUMA node. Depending on the usage of the database, these single threads may be a significant performance bottleneck. Configuring four soft-NUMA nodes provides four I/O threads and four lazy writer threads, which could increase performance."
I remembered reading something along these lines after recently looking at a number of client servers that had 24 cores and up to 128GB RAM installed in them but were not hardware NUMA configurations. Having the extra lazy writer threads sounded like a good thing so I went about trying to figure out how to setup Soft NUMA which lead me to the BOL Topic How to: Configure SQL Server to Use Soft-NUMA which provided the following scenario:
“Instance A, which experiences significant I/O, now has two I/O threads and two lazy writer threads, while instance B, which performs processor-intensive operations, has only one I/O thread and one lazy writer thread. "
Perfect this is exactly what I was thinking, so now off to do some testing on one of my VM’s, and this is where I hit my first bump. I needed to figure out the appropriate CPUMask values for a 24 core system to subdivide it into four Soft NUMA nodes of 6 cores each, but the Books Online, and the available blog posts for the topic don’t make that very easy to accomplish and they don’t explain the basis of the CPUMask very well either. I was able to figure out with some testing, that the CPUMask is just a bitmask of the CPUMask values of the CPU’s that contribute to Soft NUMA node. Clear as mud right?
To make this a bit easier, I am going to paste some tables that I created in Excel below that demonstrate what I am talking about, but for a 8 core example and not the 24 core solution I was originally working on. The first thing you have to keep in mind is that CPU’s are numbered with a CPUID starting with a base of zero, and each CPU has its own mask value that is two to the power of the CPUID. For the example in the How to: Configure SQL Server to Use Soft-NUMA Books Online Topic, this works out logically to the table below:
|CPUID||Mask||Soft Node||CPU BitMask||Node Registry CPUMask|
The CPU Mask Sum column is the sum of the Masks for each of the CPU’s in the Soft Node and then the Node Registry CPUMask is the hexadecimal representation of the decimal value. This can be obtained using the DEC2HEX() function in Excel, or by placing the Windows Calculator into Programmer Mode from the File Menu.
So now that we have an easier to understand example covered, here is what the CPUMask for a 24 core server with four 6 core Soft NUMA Nodes would be:
|CPUID||Mask||Soft Node||CPU BitMask||Node Registry CPUMask|
To test this configuration out, I went to one of my SQL Server 2008R2 VM’s on my laptop and made use of the –Pn startup trace flag which tells SQL Server to create n schedulers during the SQLOS boot up even if the actual hardware does not exist for those schedulers. This is an undocumented startup parameter and it should NEVER BE USED in a production server. Just because the SQLOS creates additional schedulers, doesn’t mean that you physically have additional processors in the server, and if you want to see how bad this startup parameter can cause performance to drag, create 24 schedulers on your laptop and then run a couple of concurrent queries in SQL Server. After adding ;-P24 to startup parameters and restarting my SQL instance I now have 24 schedulers in sys.dm_os_schedulers as shown below:
Now all I have to do is add the registry keys to create my Soft NUMA nodes. At this point I realized that I had really complicated figuring out the hexadecimal CPUMask values. The DWORD registry value accepts the Decimal or Hexadecimal value in regedit, so you can skip converting the CPU BitMasks to hexadecimal, if you ever actually configure Soft NUMA for one of your servers.
After setting all of the registry keys for my instance, and then restarting my instance I went back in to sys.dm_os_nodes to make sure that my NUMA configuration changes were correctly made.
Everything looks perfect, I now have four Soft NUMA nodes with 6 schedulers in each node, so lets go look at our new Lazy Writer Threads in sys.dm_exec_requests:
This is only a subset of the system sessions that actually exist on the instance but trust me when I tell you, all of the Lazy Writer threads are shown above; there is only one. So what happened here? I have four Soft NUMA nodes, we confirmed that so why don’t I have any additional Lazy Writer threads? I was certain I had done something wrong here, so I sent tweet to Amit Banerjee (Blog|Twitter) a good friend of mine from the MSDN Forums who happens to be on the Product Support Team in India. It turns out that the Books Online is wrong about the benefits of using Soft NUMA, and this was pointed out a few years ago by Bob Dorr, another escalation engineer in Product Support on his blog post How It Works: Soft NUMA, I/O Completion Thread, Lazy Writer Workers and Memory Nodes. Two key points are made by Bob in this blog post; first that Lazy Writers are only assigned to Hard NUMA nodes, which map to SQLOS memory nodes, and second the IO Completion threads that are created per Soft NUMA node have nothing to do with Disk I/O, but are instead for connection requests and TDS traffic over the network.
Based on this, I filed a connect feedback for the incorrect entries in the Books Online for Soft NUMA that will eventually see this incorrect information corrected. At the point that I realized this, I had just over four hours of time dedicated to figuring out how to make something work that will never actually do what I expected based on the documentation, but I had a much better understanding of how NUMA actually worked inside of SQL Server, which is something that had up to this point been a big mystery to me.
This leaves open the question, when would you actually make use of Soft NUMA in SQL Server, if it doesn’t actually provide CPU to Memory masking like Hard NUMA does, or provide any benefits of having additional Lazy Writer Threads? One use case would be to provide port affinity to drive specific connections to a specific node as discussed in the How It Works: SQL Server 2005 NUMA Basics blog post, also by Bob Dorr, but this only applies to serial execution plans and not to parallel execution plans as detailed by Bob in the NUMA Connection Affinity and Parallel Queries blog post. An example of where Soft NUMA usage is the ETL World Record set by SQL Server back in February 2008 which used Soft NUMA to balance the connection load as a part of achieving the world record of 1.18TB of flat file data loaded in 1794 seconds.