T-SQL Tuesday #119 Changing Your Mind

This post is a response to this month’s T-SQL Tuesday #119 prompt by Alex Yates.  T-SQL Tuesday is a way for the SQL Server Community to share ideas about different database and professional topics every month. This month’s topic asks us to write about something in your IT career that you have changed your mind about. What was your original opinion? Why did you believe that? What do you believe now? Why did you change your mind?.

MJ-t-sql-Tuesday


Introduction

I’ve been writing and presenting about SQL Server hardware for many years now. I actually wrote a book called SQL Server Hardware back in 2011, which is probably the only book about server hardware from a SQL Server perspective. During most of my coverage of SQL Server hardware related topics, I have been a big proponent of using Intel processors for SQL Server workloads. I even used to make joking comments during some of my presentations about how AMD probably had a contract out on me, because I gave such negative opinions about their processors for SQL Server usage. There were valid reasons why I had a negative opinion of AMD processors though.

Why I Used to Only Recommend Intel Processors

After the introduction of the 45nm Intel Nehalem microarchitecture in 2008, Intel was completely dominant compared to AMD when it came to single-threaded CPU performance for server processors. Intel was also successfully executing their Tick-Tock release cadence, where they released a new CPU microarchitecture roughly every two years (a Tock) and then introduced a manufacturing process shrink on the same basic microarchitecture (with minor improvements) the following year (a Tick).

This Tick-Tock release cycle made it easier to plan when to push for a hardware upgrade or a complete data platform refresh. Figure 1 shows the Intel Server Processor Family Tree from 2008 to the present. The Tick-Tock model was in place from 2007 until early 2016. During this period, Intel captured over 99% of the overall x86 server CPU market, with many server vendors discontinuing their AMD-based server models. The situation was so bad that Microsoft offered a 25% license cost discount if you used qualifying AMD processors with Biztalk Server and SQL Server 2012 or 2014 in a document called the Core Factor Table.

Intel Server Processor Family Tree

Figure 1: Intel Server Processor Family Tree


When SQL Server 2012 was released on April 1, 2012, it brought with it a move to core-based licensing for SQL Server, which replaced the old processor-based licensing that was used in previous versions of SQL Server. This licensing change made it very important to do some careful analysis before selecting the exact processor(s) for a database server or virtualization host that was going to be running SQL Server VMs.

This licensing change by Microsoft happened shortly after the introduction of the ill-fated 32nm AMD Bulldozer microarchitecture, which featured up to 16 physical cores in the Opteron 6200 series. Unfortunately, these AMD processors had very poor single-threaded performance and high power usage to go with their high SQL Server licensing costs. This was not a good combination, hence the Microsoft Core Factor Table.

For licensing purposes, Microsoft didn’t care whether you had a fast processor core or a slow processor core, the license price per core was exactly the same. Knowing this, a smart database professional could purposely select the fastest processor SKU at a given core count, and also try to select a lower core count (but faster) processor SKU in order to get the most performance per core and in order to minimize their SQL Server core license costs.

If you did this properly, it was pretty easy to save many tens of thousands of dollars on SQL Server core license costs on a two-socket server, which would more than offset the hardware cost of of a typical two-socket database server. Using a frequency-optimized processor SKU would also deliver better single-threaded CPU performance and more memory bandwidth than an entry level processor SKU at the same core count.

Why I Now Recommend You Seriously Consider AMD Processors

For Opteron server processors, AMD was stuck with the Bulldozer microarchitecture (and minor improvements with the newer Piledriver and Excavator microarchitectures) until the release of the completely new Zen architecture in 2017. During this period, Intel continued to completely dominate the server CPU market, with no meaningful competition from AMD, except on the low-end. As often happens with a lack of competition, the market leader became complacent over time, to the point where AMD was able to catch up and actually surpass Intel in many areas. As a result, AMD is starting to regain some market share in the server CPU market.

Intel has badly stumbled over the past 4-5 years as they have tried to move from a 14nm manufacturing process to a 10nm manufacturing process. This has forced them to abandon the old Tick-Tock release cycle, and it has also lead to a longer product cycle overall.

This has been accompanied by a very noticeable reduction in generational performance increases since Broadwell-EP, as shown in Figure 2. These numbers are estimated TPC-E scores for a two-socket server with two, eight-core processors, using the fastest eight-core processor from each generation. Based on this, we have seen an extremely small increase in performance over the last four years.

Generational Xeon Performance Increases

Figure 2: Generational Intel Xeon Processor Performance Increases


Intel Security Vulnerabilities

Intel has also had to deal with multiple processor security vulnerabilities. These include Spectre, Meltdown, Foreshadow and their variants, along with newer exploits such as Zombieload. The latest CPU security vulnerability I have heard of is NetCAT, which only affects Intel processors.

Generally speaking, modern Intel processors are more vulnerable to these types of attacks than modern AMD processors are. The required fixes for these vulnerabilities also have a more negative effect on performance for Intel processors than on AMD processors, especially since AMD processors are not affected by many of these exploits. Microsoft’s current SQL Server specific guidance about this subject is here.

On August 7, 2019, AMD finally unveiled their new 7nm EPYC 7002 Series of server processors, formerly code-named “Rome” at the AMD EPYC Horizon Event in San Francisco. This is the second generation EPYC server processor that uses the same Zen 2 architecture as the AMD Ryzen 3000 Series desktop processors. These new processors are socket compatible with the previous generation 14nm AMD EPYC 7001 Series processors, so they will work in existing model servers (with a BIOS update). Despite that, you will need a new model server to be able to use PCIe 4.0 support from the newer processors. The major server vendors like Dell/EMC and HPE have already released new server models that let you fully leverage the new AMD EPYC 7002 series processors.

AMD is not slowing the pace of innovation, since they are on track to release the Zen 3 based “Milan” server processors in Q3/Q4 of 2020. These will have IPC improvements, better L3 cache performance, and probably increased clock speeds.

AMD EPYC 7000 Series Roadmap

Figure 3: AMD EPYC 7000 Series Roadmap


Conclusion

I have previously written about why the AMD EPYC 7002 Series processors are going to be significant for SQL Server. The 7nm EPYC 7002 series has higher memory density and capacity, higher memory speed and bandwidth, more PCIe bandwidth, PCIe 4.0 support, and much lower pricing than 14nm Intel Cascade Lake-SP processors.

A one-socket AMD server will be a superior replacement for many legacy two-socket Intel servers, while a two-socket AMD server will also be a superior replacement for many legacy four-socket Intel servers. I think we will see AMD’s server CPU market share go up to 10-15% over the next 12-18 months. We may also see more aggressive pricing from Intel as a result (which has already happened with the new Intel Cascade Lake-X HEDT and Xeon W workstation processors).

I think an AMD platform is a viable choice for many SQL Server workloads, so I have changed my mind compared to what I thought in the past.


T-SQL Tuesday #118 My Fantasy SQL Server Feature

This post is a response to this month’s T-SQL Tuesday #118 prompt by Kevin Chant.  T-SQL Tuesday is a way for the SQL Server Community to share ideas about different database and professional topics every month. This month’s topic asks about our fantasy SQL Server feature.

MJ-t-sql-Tuesday

Introduction

It may come as no surprise, but my fantasy SQL Server feature is related to hardware and storage. This is a general idea that I have had for many years, that I have brought up informally with some fairly senior people at Microsoft in the past.

Essentially, I think it would be very useful if SQL Server had some sort of internal benchmarking/profiling utility that could be run so that SQL Server could measure the relative and actual performance of the hardware and storage that it was running on. Then, the results of these tests could be used to help the SQL Server Query Optimizer make better decisions about what sort of query plan to use based on that information.

For example, depending on the actual measured performance of a processor core (and the entire physical processor) from different perspectives, such as integer performance, floating point performance, AVX performance, etc., it might make more sense to favor one query operator over another for certain types of operations. Similarly, knowing the relative and actual performance of the L1, L2, and L3 caches in a processor might be useful in the same way.

Going deeper into the system, knowing the relative and actual performance of the DRAM main memory (and PMEM) in terms of latency and throughput seems like it would be useful information for the Query Optimizer to know about. Finally, understanding the relative and actual performance of the storage subsystem in terms of latency, IOPs, and sequential throughput would probably pretty useful in some situations.

Windows Experience Index

A historical, consumer-facing example was the old Windows Experience Index in Windows 7, that would run a quick series of benchmarks to measure the processor, memory, graphics, gaming graphics, and primary hard disk performance of a system. The resulting scores (in Windows 7) could range from 1.0 to 7.9. The scores for my old 2012-vintage Intel Core i7-3770K desktop system are shown in Figure 1. The purpose of these scores (beyond bragging rights) was to help consumers make better decisions about possible upgrade choices for different system components or to simply understand what the biggest bottlenecks were in their existing system. It was also used as a quick way to compare the relative performance of different systems in a store.

Windows 7 WEI

Figure 1: Windows Experience Index Scores on Windows 7


Azure Experience Index

My idea is to have something similar from a SQL Server perspective, that could optionally be used by the SQL Server Query Optimizer (and any other part of SQL Server) to make better decisions based on the actual, measured performance of the key components of the system it is running on. This would be useful no matter how SQL Server was deployed, whether it was running bare-metal on-premises, in an on-premises VM, in a Container, in an Azure VM, in Azure SQL Database, or in SQL Managed Instance. It would also work in any other cloud environment, and on any supported operating system for SQL Server.

Despite what you might hear, the details of the hardware and storage of the actual hardware that your SQL Server deployment is running on, make a significant difference to the performance and scalability you are going to see. Low-level geeky things like the exact processor (and all of its performance characteristics), the exact DRAM modules (and their number and placement), the NUMA layout, the exact type and configuration of your storage, your BIOS/UEFI settings, your hypervisor settings, etc. There are so many possible layers and configuration options in a modern system, it can be quite overwhelming.

Based on source code-level knowledge of what SQL Server does in general and how the Query Optimizer works, along with all of the performance telemetry that is collected by Azure SQL Database, Microsoft should be able to determine some relationships and tendencies that tie the actual measured performance of the main components of a system (as it is currently configured as a whole) to common SQL Server query operations.

For example, as an imaginary possibility, perhaps the performance of a hash match is closely related to integer CPU performance, along with L1, L2, and L3 cache latency, DRAM latency and throughput. Different systems, based on differences in the exact CPU, BIOS settings, memory type and speed, etc. might have significantly different performance for a hash match, to the point that the Query Optimizer would want to take that into account when choosing the operators for a query plan. Perhaps it could be called Hyper-Adaptive Cognitive Query Processing… Smile

Even if that level of tuning wasn’t immediately possible, having a deeper understanding of what specific performance characteristics were the most critical to common query operations for different query workloads would help Microsoft make better decisions on what processor, memory, and storage specifications and configuration settings work the best together for different workloads. This could potentially save huge amounts of money in Azure Data Centers. Microsoft can custom design/build hardware as part of the Open Compute Project, and they can get custom server processor SKUs with their desired performance characteristics from AMD and Intel to take advantage of this type of knowledge. They can also configure everything else in the system just as they desire for a particular workload.

Obviously, this is a complicated idea that would take significant resources to develop. I’m sure that Microsoft has other development priorities, but this is my fantasy feature, and I’m sticking with my story.


T-SQL Tuesday #104: Code I Have Written That I Would Hate to Live Without

Bert Wagner (b/t) is hosting T-SQL Tuesday #104. The invitation is to write about code you’ve written that you would hate to live without. For me, this is almost a no-brainer!

My DMV Diagnostic Queries represent a lot of code that I would hate to live without. I use them on a daily basis to gather information about SQL Server instances and databases and to help more quickly understand what configuration and performance issues they have. I’ve been publicly posting these queries since 2009, but I actually started developing them for my own personal use back in 2006. The story about how they came about is kind of interesting…

Back in about August of 2006, I was the sole DBA for NewsGator, which was (at that time) an RSS aggregation company. Our main product/service was the ability to let people “subscribe” to RSS feeds for web sites and blogs, and then have us download the RSS feeds of those sites. We would also manage the “read state” of the RS feeds that you subscribed to, so that as you read through your subscribed content and marked posts as read, we would synchronize your progress across different devices.

I had only been at the company about three months, and we had recently migrated from 32-bit SQL Server 2000 to 64-bit SQL Server 2005 SP1 on a two-node FCI running on new hardware. Performance had been pretty good since the migration, and it was about 4:30PM on a Friday afternoon, when I started making some final quick checks of the health of my instance before getting ready to leave for the weekend.

I noticed that my CPU utilization was running about 90-95%, which was much higher than normal. I tried a few of my standard DBA tricks (at that time) to correct the issue, such as running sp_updatestats, running DBCC FREEPROCCACHE, etc. with no real improvement. I even took the emergency step of “shutting down” the content servers (which were application servers that downloaded the RSS feeds, that typically generated about 90% of my database load). This had no appreciable effect on my CPU utilization.

By now, I was getting worried, since we had a problem that I did not immediately know how to diagnose and correct. By this time, our support team and many of the senior executives in the company were aware that we had a problem because our applications were starting to time out and throw errors. I had a literal parade of different people coming to my desk asking some variation/combination of “What’s wrong with the database?” or “What can we do to help?”.

This got so bad that the CTO/Founder of NewsGator (Greg Reinaker) grabbed a large rolling whiteboard, and wrote something like “Glenn knows there is a problem. He is working on it. Please leave him alone”, which was actually pretty helpful.

So after some time, it ended up being just me and the best developer on the Platform Team (Jeff Tingley) staying late into the night and next morning, on a call with Microsoft Premier Support trying to diagnose and troubleshoot the issue. Eventually, we figured out that our problem was mainly caused by parameter sniffing in one stored procedure where we were getting one very inefficient plan stuck in the plan cache.

The short-term fix was to use a local variable to store an input parameter for that stored procedure to disable parameter sniffing for that stored procedure, and to periodically recompile a few other stored procedures that were also part of the problem. Jeff and I finally left around 3AM, with the system being relatively stable. I was exhausted from the time and the stress of feeling like the fate of the company rested on my shoulders. I was convinced that I was in big trouble and was possibly going to be fired since it had taken us so long to figure out the problem. Little did I know…

As it turned out, my boss’s boss decided to give both Jeff and I a $500.00 bonus, plus we got a big round of applause at a company meeting the next Monday (which I appreciated much more).

This incident was the genesis of my DMV Diagnostic queries. I never wanted to be in that situation again! Anytime there was any application slowdown, people always assumed that it was a database problem (which was not always the case). Having a set of queries that I could run to figure out what was going on with the database and database server was the key to being able to answer the “What’s wrong with the database?” question.

Many thousands of people around the world use my queries on a regular basis, and seem to find them very useful, at least based on the feedback I have gotten over the years. Now you know the story of how they came into being.

Figure 1: Link to Invitation Post