SQLskills SQL101: Why are Statistics so Important?

In my years working with SQL Server, I’ve found there are a few topics that are often ignored. Ignored because people fear them; they think they’re harder than they actually are OR they think they’re not important. Sometimes they even think – I don’t need to know that because SQL Server “does it for me.” I’ve heard this about indexes. I’ve heard this about statistics.

So, let’s talk about why statistics are so important and why a minimal amount of knowledge of their importance can make a HUGE difference to your query’s performance.

There are SOME aspects to statistics that ARE handled automatically:

  • SQL Server automatically creates statistics for indexes
  • SQL Server automatically creates statistics for columns – when SQL Server NEEDS more information to optimize your query
    • IMPORTANT: This only occurs when the database option: auto_create_statistics is ON; this option is on by default in ALL versions/editions of SQL Server. However, there are some products/guidance out there that recommends that you turn this off. I completely DISAGREE with this.
  • SQL Server automatically updates the statistical information kept for indexes/columns for a statistic that’s been “invalidated” – when SQL Server NEEDS it to optimize your query

However, it’s important to know that while some aspects of statistics ARE handled automatically; they’re not perfect. And, SQL Server often needs some with certain data patterns and data sets – to really optimize effectively. To understand this, I want to talk a bit about accessing data and the process of optimization.

Accessing Data

When you submit a request to SQL Server to access data, you usually write a Transact-SQL statement – in the form of an actual SELECT or possibly execute a stored procedure (and, yes, there are other options). However, the main point is that you’re requesting a data set – not HOW that data set should be found/processed. So, how does SQL Server “get” to the data?

Processing Data

Statistics help SQL Server with the gathering / processing of the data. They help by giving SQL Server insight into how much data will be accessed. If a small amount of data is going to be accessed then the processing might be easier (and require a different process) than if the query were to process millions of rows.

Specifically, SQL Server uses a cost-based optimizer. There are other options for optimization but cost-based optimizers tend to be the most commonly used today. Why? Cost-based optimizers use specific information about the data – to come up with more effective, more optimal, and more focused plans – specific to the data. Generally, this process is ideal. However, with
plans that are saved for subsequent executions (cached plans), it might not always be ideal. However, most other options for optimization can suffer from even worse problems.

IMPORTANT: I’m not talking about cached plans here… I’m talking about the initial process of optimization and how SQL Server determines how much data will be accessed. The subsequent
execution of a cached plan brings additional potential problems with it (known as parameter sniffing problems or parameter-sensitive plan problems). I talk a lot about that in other posts.

To explain cost-based optimization, let me cover other possible forms of optimization; this will help to explain a lot of the benefits of cost-based optimization.

Syntax-based Optimization

SQL Server could use just your syntax to process your query. Instead of taking time to determine the best possible processing order for your tables, an optimizer could just join your tables in the order they’re listed in your FROM clause. While it might be fast to START this process; the actual gathering of the data might not be ideal. In general, joining larger tables to smaller tables is a lot less optimal than joining smaller tables to larger. To give you a quick view of this, check out these two statements:

USE [WideWorldImporters];
GO

SET STATISTICS IO ON;
GO

SELECT [so]., [li].
FROM [sales].[Orders] AS [so]
JOIN [sales].[OrderLines] AS [li]
ON [so].[OrderID] = [li].[OrderID]
WHERE [so].[CustomerID] = 832 AND [so].[SalespersonPersonID] = 2
OPTION (FORCE ORDER);
GO

SELECT [so]., [li].
FROM [sales].[OrderLines] AS [li]
JOIN [sales].[Orders] AS [so]
ON [so].[OrderID] = [li].[OrderID]
WHERE [so].[CustomerID] = 832 AND [so].[SalespersonPersonID] = 2
OPTION (FORCE ORDER);
GO

Review the cost of the two plans – side-by-side, you can see that the second is more expensive.

Costs of the same query with FORCE ORDER and only the two tables in the FROM clause reversed between the 1st and 2nd execution

Yes, this is an oversimplification of join optimization techniques but the point is that it’s unlikely that the order you’ve listed your tables in the FROM clause is the most optimal. The good news though – if you’re having performance problems, you CAN force syntax-based optimization techniques such joining in the order as your tables are listed (as in the example). And, there are quite a few other possible “hints” that you can use:

  • QUERY hints (forcing level of parallelism [MAXDOP], forcing SQL Server to optimize for the FIRST row – not the overall set with FAST n, etc…)
  • TABLE hints (forcing an index to be used [INDEX], forcing seeks, forcing an index join [index intersection])
  • JOIN hints (LOOP / MERGE / HASH)

Having said that, these should ALWAYS be a last resort IMO. I would try quite a few other things before I’d FORCE anything in production code. By forcing one (or, more) of these hints, you don’t allow SQL Server to further optimize your queries as you add/update indexes, as you add/update statistics, or as you update/upgrade SQL Server. Unless, of course, you’ve documented these hints and you test to make sure they’re still the best option for your query after maintenance routines, routinely after data modifications, and after updates/upgrades to SQL Server. To be honest, I don’t see this done as often as it should be. Most hints are added as if they’re perfect and left until later problems come up – even when they’re possibly floundering inefficiently for days, weeks, months, etc…

Understand statistics and stop letting your queries flounder!
(sorry, I had to… it was my photo post for Sunday – so perfect!)

It’s always important to see if there’s something ELSE that you can do to get better performance BEFORE FORCING a query to do ONE THING. Yes, forcing might be easy NOW but a small amount of time checking some of these other things might be well worth it in the long run.

For plan problems that are specifically from THIS query and THESE values (not from the execution of a cached plan), some things I might try are:

  • Are the statistics accurate / up-to-date? Does updating the statistics fix the issue?
  • Are the statistics based on a sampling? Does FULLSCAN fix the issue?
  • Can you re-write the query and get a better plan?
    • Are there any search arguments that aren’t well-formed? Columns should always be isolated to one side of an expression…
      • WRITE: MonthlySalary > expression / 12
      • NOT: MonthlySalary * 12 > expression
    • Are there any transitivity issues you can help SQL Server with? This is an interesting one and becomes more problematic across more and more tables. However, adding a seemingly redundant condition can help the optimizer to do something it hadn’t considered with the original query (see, the added condition below):

FROM table2 AS t1
JOIN table2 AS t2
               ON t1.colX = t2.colX
WHERE t1.colX = 12 AND t2.colX = 12

  • Sometimes just changing from a join to a subquery or a subquery to a join – fixes the issue (no, one isn’t specifically/always better than the other but sometimes the re-write can help the optimizer to see something it didn’t see with the original version of the query)
  • Sometimes using derived tables (a join of specific tables in the FROM clause AS J1) can help the optimizer to more optimally join tables
  • Are there any OR conditions? Can you re-write to UNION or UNION ALL to get the same result (this is the most important) AND get a better plan? Be careful though, these are semantically NOT the same query. You’ll need to understand the differences between all of these before you move to a different query.
    • OR removes duplicate rows (based on the row ID)
    • UNION removes duplicates based on JUST the SELECT LIST
    • UNION ALL concatenates sets (which can be A LOT faster than having to remove duplicates) but, this might / might not be an issue:
      • Sometimes there are NO duplicates (KNOW your data)
      • Or, sometimes it might be OK to return duplicates (KNOW your user/audience/application)

This is a short list but might help you get better performance WITHOUT using hints/forcing. That means, as subsequent changes occur (to the data, to the indexes/statistics, to the SQL Server version) you’ll also be able to benefit from those changes!

But, the coolest part, is that these hints are there – IF we do have to use them! And, there ARE cases where the optimizer might not be able to come up with the most efficient plan; for those cases, you can force the process. So, yes, I love that they’re there. I just don’t want you to use them when you really haven’t determined what the actual root cause of the problem is…

So, yes, you can achieve syntax-based optimization…IFF you NEED it.

Rules-based Optimization (Heuristics)

I mentioned that cost-based optimization requires statistics. But what if you don’t have statistics?

Or, maybe a different way to look at it – why can’t SQL Server just use a “bunch of rules” to more-quickly optimize your queries without having to look at / analyze specific information about your data? Wouldn’t that be faster? SQL Server can do this – but, it’s often FAR from ideal. The best way to show you this is to NOT allow SQL Server to use statistics to process a query. I can show you a case where it actually works well and a MUCH MORE likely case where it doesn’t.

Heuristics are rules. Simple, STATIC / FIXED calculations. The fact that they’re simple is their benefit. No data to look at. Just a quick estimate based on the predicate. For example, less than and greater than have an internal rule of 30%. Simply put, when you run a query with a less than or greater than predicate and NO INFORMATION is known about the data in that column / predicate, SQL Server is going to use a RULE that 30% of the data will match. They’ll use this in their estimate / calculations and come up with a plan tied to this rule.

To get this to “work,” I have to first disable auto_create_statistics and check to see if there’s anything there (already) that could possibly help my query:

USE [WideWorldImporters];
GO

ALTER DATABASE [WideWorldImporters]
SET AUTO_CREATE_STATISTICS OFF;
GO

EXEC sp_helpindex '[sales].[Customers]';
EXEC sp_helpstats '[sales].[Customers]', 'all';
GO

Check the output from sp_helpindex and sp_helpstats. By default, WideWorldImporters does not have any indexes or statistics that lead with DeliveryPostalCode (or even have it as a member). If you’ve added one (or, SQL Server has auto-created one), you must drop it before running these next code samples.

For this first query, we’ll supply a zip code of 90248. The query is using a less than predicate. Without any statistics and without the ability to auto-create any, what does SQL Server estimate for the number of rows?

Columns without statistics use heuristics…
Without engineering a “PERFECT” value, they’re going to be WRONG… most of the time!

For this first query, the estimate (30% of 663 = 198.9) works well, as the actual number of rows for the query is 197. One additional item to notice, there’s a warning symbol next to the table (Customers) as well as the left-most SELECT operator. These also tell you that something’s wrong. But, here – the estimate is “correct.”

For this second query, we’ll supply a zip code of 90003. The query is the exact same except for the value of the zipcode. Without any statistics and without the ability to auto-create any, what does SQL Server estimate for the number of rows?

SELECT [c1].[CustomerID],
      [c1].[CustomerName],
      [c1].[PostalCityID],
      [c1].[DeliveryPostalCode]
 FROM [sales].[Customers] AS [c1]
 WHERE [c1].[DeliveryPostalCode] < '90003';
Columns without statistic use heuristics (simple rules). Often, they are very incorrect!

For the second query, the estimate is again 198.9 but the actual is only 1. Why? Because without statistics, the heuristic for less than (and, greater than) is 30%. Thirty perfect of 663 is 198.9. This will change as the data changes but the percent for this predicate is fixed at 30%.

And, if this query were more complicated – with joins and/or additional predicates – having incorrect information is ALREADY a problem for the subsequent steps of optimization. Yes, you might occasionally get lucky and use a value that loosely matches the heuristic but MORE LIKELY, you won’t. And, while the heuristic for less than and greater than is 30%, it’s different for BETWEEN and different again for equality. In fact, some even change based on the cardinality estimation model you’re using (for example, equality (=) has changed). Do I even care? Actually, not really! I don’t really ever WANT to use them.

So, yes, SQL Server can use rules-based optimization…but, only when it doesn’t have better information.

I don’t want heuristics; I want statistics!

Statistics are one of the FEW places in SQL Server where having MORE of them can be beneficial. No, I’m not saying to create one for every column of your table but there are some cases where I might pre-create statistics. OK, that’s a post for another day!

So, WHY statistics are important for cost-based optimization?

Cost-based Optimization

What does cost-based optimization actually do? Simply put, SQL Server QUICKLY gets a rough estimate of how much data will be processed. Then, using this information, it estimates the costs of different possible algorithms that could be used to access the data. Then, based on the “costs” of these algorithms, SQL Server chooses the one that it’s calculated to be the least expensive. Then, it compiles it and executes it.

This sounds great; but there are a lot of factors as to why this may or may not work well.

Most importantly, the base information used to perform the estimations (statistics) can be flawed.

  • They could be out of date
  • They could be less accurate due to sampling
  • They could be less accurate because of the table size and the limitations of statistics “blob” that SQL Server creates

Some people ask me – couldn’t SQL Server have more detailed statistics (larger histograms, etc.). Yes, they could. But, then the process of reading/accessing these larger and larger statistics would start to get more and more expensive (and take more time, more cache, etc.). Which would – in turn – make the process of optimization more and more expensive. It’s a difficult problem; there are pros and cons / trade-offs everywhere. Really, it’s not quite as simple as just “larger histograms.”

Finally, the process of optimization itself can’t possibly EVERY combination for execution / processing, can it? No. That would make the sheer process of optimization so expensive that it would be prohibitive!

In Summary:

The best way to think about the process of optimization is that SQL Server MUST find a “good plan fast.” Otherwise, the process of optimization would become so complicated and take so long, it would defeat its own purpose!

So, why are statistics soooooo important:

  • They feed into the whole process of cost-based optimization (and, you WANT cost-based optimization)
  • They need to exist for optimization otherwise you’ll be forced to use heuristics (yuck)
    • IMO – generally, I HIGHLY RECOMMEND auto_create_statistics to be ON, if you’ve turned it off
  • They need to be reasonably accurate to be give better estimates (keeping them up to date is important but some statistics might need more “help” than others staying up to date)
  • They might need even further help (with HOW they’re updated, WHEN they’re updated, or even with ADDITIONAL statistics)

STATISTICS ARE THE KEY TO BETTER OPTIMIZATION AND THEREFORE BETTER PERFORMANCE!
(but, they’re far from perfect!)

I hope that motivates you to consider learning a bit more about statistics. They’re actually easier than you might think AND (obviously) they’re really, really important!

Where can you get more information?

Whitepaper: There’s a VERY dated, but still has some great concepts, whitepaper on Statistics here: Statistics Used by the Query Optimizer in Microsoft SQL Server 2008

Online Training with Pluralsight: I don’t [yet] have a course just on statistics but – out of necessity – I talk about them in some of my other courses. This course: SQL Server: Optimizing Ad Hoc Statement Performance has the best discussion. Specifically, be sure to review Module 3: Estimates and Selectivity.

TIP: If you’re not a Pluralsight subscriber, shoot an email to Paul at SQLskills dot com and tell him I sent you there for a PS code. That will give you 30 days to check out our SQLskills content on Pluralsight!

In-Person Events: And, we have some in-person events coming up where you can certainly learn a lot more about SQL Server, and specifically statistics:

SQLBits in London – March 31-April 4, 2020

My FULL-DAY workshop: Statistics for Performance: Internals, Analysis, Problem Solving

SQLintersection in Orlando, FL – April 5-10, 2020

My 60 minute conference session: Statistics: Internals, Analysis, and Solutions

SQLskills Immersion Events

IEPTO1 has a few modules (2+ DAYS) related to indexing and statistics!

I HIGHLY recommend that you #NeverStopLearning and that you access as many different mediums / learning paths as possible. This stuff isn’t simple and it takes time to really digest it and LEARN it.

Thanks for reading!
-k

SQLintersection Fall 2017 – 4 weeks to go!

As we head towards our 10th SQLintersection in four weeks, we’re excited to say that it’s once again our most diverse, complete, and information-packed show yet!

One of the pieces of feedback we hear over and over is that attendees love SQLintersection because it’s a smaller, laid-back show, where you get to actually spend time talking with the presenters 1-1. That’s one of the reasons why we love the show so much; *we* get to spend time talking to attendees, rather than being mobbed by hundreds of people after a session ends. And we only pick presenters who we know personally, and who we know to be humble, approachable, and eager to help someone out.

We have 2 pre-con days at the show and with our post-con day, there are 9 full-day workshops from which to choose. We have 40 technology-focused (NOT marketing) sessions from which to choose, plus two SQL Server keynotes, multiple industry-wide keynotes by Microsoft executives, and the ever-lively closing Q&A that we record as a RunAs Radio podcast.

You’ll learn proven problem-solving techniques and technologies you can implement immediately. Our focus is around performance monitoring, troubleshooting, designing for scale and performance, cloud, as well as new features in SQL Server 2014, 2016, and 2017. It’s time to determine your 2008 / 2008 R2 migration strategy – should you upgrade to 2016/2017 directly? This is the place to figure that out!

If you’re interested in how we got here – check out some of my past posts:

  1. SQLintersection: a new year, a new conference
  2. SQLintersection’s Fall Conference – It’s all about ROI!
  3. Fall SQLintersection is coming up soon and we can’t wait!
  4. SQLintersection Conference and SQLafterDark Evening Event – what a fantastic week in Vegas

And I recorded a Microsoft Channel 9 video where I discusses the Spring show – see here.

SQLafterDark

With minimal to no marketing filler, we’ve largely kept our conference focus on ROI and technical content (performance / troubleshooting / tales-from-the-trenches with best practices on how to fix them ) but we’ve also added even more social events so that you really get time to intersect with the conference attendees and speakers. The addition of the SQL-specific, pub-quiz-style evening event SQLafterDark was wildly popular from some of our past shows and that’s returning for Fall!

 

SQLintersection: Great Speakers!

Once again, we think a great show starts with great speakers and current / useful content. All of these speakers are industry-experts that have worked in data / SQL for years (some can even boast decades) but all are still focused on consulting and working in the trenches. And, they’re good presenters! Not only will you hear useful content but you’ll do so in a way that’s digestible and applicable. Every speaker is either an MCM (Master), a SQL Server MVP, or a past/present Microsoft employee (or a combination of all three!) But, regardless of their official credentials – ALL are focused on providing the most ROI that’s possible in their session(s) and/or their workshops, and ALL have spoken for SQLintersection multiple times.

Check out this phenomenal list of speakers:

  • Aaron Bertrand – MVP, SentryOne
  • David Pless – MCM, Microsoft
  • Jes Borland, past-MVP, Microsoft
  • Jonathan Kehayias – MCM, MCM Instructor, MVP
  • Justin Randall, MVP, SentryOne
  • Kimberly L. Tripp – MCM Instructor, MVP, past Microsoft, SQLskills
  • Paul S. Randal – MCM Instructor, MVP, past Microsoft, SQLskills
  • Shep Sheppard – past Microsoft, Consultant
  • Stacia Varga, MVP, Consultant
  • Tim Chapman – MCM, Microsoft
  • Tim Radney – MVP, SQLskills

You can read everyone’s full bio on our speaker page here.

SQLintersection: When is it all happening?

The conference officially runs from Tuesday, October 31 through Thursday, November 2 with pre-conference and post-conference workshops that extend the show over a total of up to 6 full days. For the full conference, you’ll want to be there from Sunday, October 29 through Friday, November 3.

  • Sunday, October 29 – pre-con day. There are two workshops running:
    • Data Due Diligence – Developing a Strategy for BI, Analytics, and Beyond with Stacia Varga
    • Performance Troubleshooting Using Waits and Latches with Paul S. Randal
    • SQL Server 2014 and 2016 New Features and Capabilities with David Pless and Tim Chapman
  • Monday, October 30 – pre-con day. There are two workshops running:
    • Building a Modern Database Architecture with Azure with Jes Borland
    • Data Science: Introduction to Statistical Learning and Graphics with R and SQL Server with Shep Sheppard
    • Extended Events: WTF OR FTW! with Jonathan Kehayias
  • Tuesday, October 31 through Thursday, November 2 is the main conference. Conference sessions will run all day in multiple tracks:
    • Check out our sessions online here
    • Be sure to check out our cross-conference events and sessions
    • Get your pop-culture trivia and techie-SQL-trivia hat on and join us for SQLafterDark on Wednesday evening, November 1
  • Friday, November 3 is our final day with three post-conference workshops running:
    • Common SQL Server Mistakes and How to Correct Them with Tim Radney
    • SQL Server 2016 / 2017 and Power BI Reporting Solutions with David Pless
    • Very Large Tables: Optimizing Performance and Availability through Partitioning with Kimberly L. Tripp

SQLintersection: Why is it for you?

If you want practical information delivered by speakers that not-only know the technologies but are competent and consistently, highly-rated presenters – this is the show for you. You will understand the RIGHT features to troubleshoot and solve your performance and availability problems now!

Check us out: www.SQLintersection.com.

We hope to see you there!

PS – Use the discount code ‘SQLskills’ when you register and receive $50 off registration!

SQLskills SQL101: Partitioning

Continuing on our path to understanding the basics and core concepts, there’s a big topic that’s often greatly misunderstood and that’s partitioning. I know I’m going to struggle keeping this one an introductory post but I’ve decided there are a few critical questions to ask and a few very important things to consider before choosing a partitioning strategy / design / architecture. I’ll start with those key points!

Partitioning is Not About Performance

I’m often asked – how can I use partitioning to improve this or that… and they’re often queries. I remember one where someone said they had a 3 billion row table and one of their queries was taking 15 minutes to run. Their question was – how can they best partition to improve query performance. I asked if I could look at the query and review their schema (secretly, that meant I wanted to see their indexes ;-)). And, sure enough, I asked if they were open to trying something. I gave them an index to create and asked them to run their query again. Without ANY changes to their table’s structure and only one added index, their query ran in 9 seconds.

Partitioning, like indexing, isn’t all unicorns and rainbows… (check out more fantastic work from tomperwomper by clicking on the image)

Now, don’t forget – indexes aren’t all unicorns and rainbows… indexes have a downside:

  • An index requires disk space / memory / space in your backups
  • An index adds overhead to your inserts / updates / deletes
  • An index needs to be maintained

So, you’d still need to determine if this is the right approach. But, the main point – partitioning really isn’t designed to give incredible gains to your queries. It’s meant to be better for data management and maintenance. However, some partitioning designs can lead to query performance benefits too.

Partitioning for Manageability / Maintenance

The primary reason to consider some form of partitioning is to give you option in dealing with large tables; I tend to start thinking about partitioning as a table approaches 100GB AND it’s not going away (but, even if some of it will regularly “go away” then partitioning is definitely an option to help manage that regular archiving / deleting). I wrote a detailed post about this in response to an MSDN Webcast question I was asked about range-based deletes on a large table. It’s a very long post but it revolves around “how to make range-based deletes faster.” You can review that here. (NOTE: That post was written before we converted our blog to the new format and I’ve not yet gone through and manually converted all of my old posts… so, the formatting isn’t great. But, the content / concepts still apply!)

So the real reasons to consider partitioning are:

  • Your table is getting too large for your current system to manage but not all of that data is accessed regularly (access patterns are really the key). There isn’t a magic size for when this happens; it’s really relative to your hardware. I’ve seen tables of only 40GB or 50GB cause problems on systems that only have 32GB or 64GB of memory). No, you don’t always need to put your entire table in memory but there are many operations that cause the table to end up there. In general, as your tables get larger and larger – many problems start to become more prominent as well. Often, a VLDB (Very Large Database) is considered 1TB or more. But, what I think is the more common problem is the size of your critical tables; when do they start causing you grief? What is a VLT (Very Large Table)? For me, as your table starts to really head toward 100GB; that’s when partitioning should be considered.
  • You have varying access patterns to the data:
    • Some data is recent, critical, and very active
    • A lot of data is not quite as recent, mostly only reads but with some modifications (often referred to as “read-mostly”)
    • The bulk of your data is historical and not accessed often but must be kept for a variety of reasons (including some large / infrequent analysis queries)

And that second point, is the most common reason to consider partitioning… but, it’s the first point that’s probably more noticable. :-)

What Feature(s) Should Be Used for Partitioning

As I’ve been discussing partitioning, I’ve been trying to generalize it as more of a concept rather than a feature (NOT tied directly to either SQL Server feature: partitioned tables [2005+] or partitioned views [7.0+]). Instead, I want you to think of it as a way of breaking down a table into smaller (more manageable) chunks. This is almost always a good thing. But, while it sounds great at first – the specific technologies have many considerations before choosing (especially depending on which version of SQL Server you’re working with).

Partitioned Views (PVs)

These have been available in SQL Server since version 7.0. They were very limited in SQL Server 7.0 in that they were query-only. As of SQL Server 2000, they allow modifications (through the PV and to the base tables) but with a few restrictions. There have been some bug fixes and improvements since their introduction but their creation, and uses are largely the same. And, even with SQL Server 2016 (and all of the improvements for Partitioned Tables), there are still scenarios where PVs make sense – sometimes as the partitioning mechanism by itself and sometimes combined with partitioned tables.

Partitioned Tables (PTs)

Partitioned tables were introduced in SQL Server 2005 and have had numerous improvements from release to release. This list is by no means exhaustive but some of the biggest improvements have been:

  • SQL Server 2008 introduced partition-aligned indexed views so that you could do fast switching in / out of partitioned tables even when the PT had an indexed view. And, SQL Server 2008 introduced partition-level lock escalation (however, some architectures [like what I recommend below] can naturally reduce the need for partition-level lock escalation).
  • SQL Server 2008 R2 (SP1 and 2008 SP2) offered support for up to 15,000 partitions (personally, I’m not a fan of this one).
  • SQL Server 2012 allowed index rebuilds to be performed as online operations even if the table has LOB data. So, if you want to switch from a non-partitioned table to a partitioned table (while keeping the table online / available), you can do this even when the table has LOB columns.
  • SQL Server 2014 offered better partition-level management with online partition-level rebuilds and incremental stats. These features reduce what’s read for statistics updates and places the threshold to update at the partition-level rather than at the table-level. And, they offer the option to just rebuild the partition while keeping the table and the partition online. However, the query optimizer still uses table-level statistics for optimization (Erin wrote an article about this titled: Incremental Statistics are NOT used by the Query Optimizer and Joe Sack wrote an article about partition-level rebuilds titled: Exploring Partition-Level Online Index Operations in SQL Server 2014 CTP1.)
  • SQL Server 2016 further reduced the threshold for triggering statistics updates to a more dynamic threshold (the same as trace flag 2371) but only for databases with a compatibility mode of 130 (or higher). See KB 2754171 for more information.

However, even with these improvements, PTs still have limitations that smaller tables don’t have. And, often, the best way to deal with a very large table is to not have one. Don’t misunderstand, what I’m suggesting is to use PVs with multiple smaller tables (possibly even PTs) unioned together. The end result is one with few, if any, real restrictions (by physically breaking your table into smaller ones, you remove a lot of the limitations that exist with PTs). However, you don’t want all that many smaller tables either as the process for optimization is more complicated with larger numbers of tables. The key is to find a balance. Often, my general recommendation is to have a separate table per year of data (image “sales” data), and then for the older years – just leave those as a single, standalone table. For the current and future years, use PTs to separate the “hot” data from the more stable data. By using standalone tables for the critical data you can do full table-level rebuilds online (in EE) and you can update statistics more frequently (and they’re more accurate on smaller tables). Then, as these standalone months stabilize you can switch them into the year-based PTs to further simplify their management moving forward.

Partitioning a VLT into smaller tables (some of which are PTs) can be a HIGHLY effective strategy for optimizing all aspects of table management, availability, and performance. But, it's definitely a more complex and difficult architecture to get right...

Partitioning a VLT into smaller tables (some of which are PTs) can be a HIGHLY effective strategy for optimizing all aspects of table management, availability, and performance. But, it’s definitely a more complex and difficult architecture to get right…

This architecture is complex but highly effective when designed properly. Doing some early prototyping with all of the features you plan to leverage is key to your success.

Key Points Before Implementing Partitioning

Partitioning is not really directly tied to performance but indirectly it can be extremely beneficial. So, for a SQL101 post, the most important point is that partitioned views still have benefits; they should not be discounted only because they’re an older feature. Both partitioning strategies provide different benefits; the RIGHT solution takes understanding ALL of their pros/cons. You need to evaluate both PVs and PTs against your availability, manageability, and performance requirements – and, in the most likely case, use them together for the most gains.

Since it’s concepts only, I still feel like it’s a SQL101 post. But, getting this right is a 300-400 level prototyping task after quite a bit of design.

Thanks for reading!
Kimberly