Time flies when you're having fun, eh? This week Paul and I are in Dublin, Ireland delivering a deep technical "SQL Immersion Event" on Internals, Performance Tuning and Maintenance with our partner ProData (and having a blast – we have a really fun and very interactive group). And, today's an even more interesting day to be in Dublin as it's the 250th anniversary of the Guinness Brewery…everyone around Dublin will be toasting the most popular drink in Ireland today (Sept 24) at 5:59pm GMT. Unfortunately, our class runs until 6pm but Paul tells me that we're still going to celebrate. ;-)
For now, we'll just have to celebrate some SQL (I can hear minor grumblings everywhere :)…
Filtered indexes and filtered stats are an incredibly powerful feature in SQL Server 2008 – one that I've posted on before and will again. Today's post is about some of the gotchas with filtered index stats and filtered stats. Cutting to the punchline – you might have some trouble with their accuracy as your data changes. The SIMPLE solution is to update these specific stats manually and more frequently (meaning that you don't want to rely on the database option AUTO_UPDATE_STATISTICS). In fact, depending on your data volatility, you might want to update these statistics on a VERY frequent basis. Part of the question you'll have to answer is how much of the filtered set changes over the course of a day/week/month. And, in whatever frequency you reach 5-10% (let's say weekly), I'd suggest updating the statistics weekly. If you update 5-10% within the course of a day – then daily. Let me give you the full story…
Statistics are traditionally updated when roughly 20% (+ a minimum of 500 rows) of the data has changed. If you have a table with 100,000 rows then SQL Server will hit the threshold at 20,500 rows (just a tad over 20%) and a table with 10,000 rows will hit the threshold at 2,500 rows (which is 25%). Additionally, in SQL Server 200x (2000, 2005 and 2008) statistics are NOT immediately updated when the threshold is reached, instead they are invalidated. It's not until someone needs the statistic that SQL Server updates it. This reduces thrashing that occurred in SQL Server 7.0 when stats were updated immediately instead of just being invalidated. Another interesting point is what is meant by "20% of the data has changed?"… How is that defined? Is it based on updates to columns or inserts of rows? Of course the answer is… it depends – here, it depends on the version of SQL Server that you're using:
- SQL Server 2000 defines 20% as 20% of the ROWS have changed. You can see this in sysindexes.rcmodctr.
- SQL Server 2005/8 defines 20% as 20% of the COLUMN data has changed. You cannot see this unless you are accessing SQL Server through the DAC as it's in a base system table (2005: sysrowsetcolumns.rcmodified and for 2008: sysrscols.rcmodified).
Now, for regular indexes (those that do not include a filter) both strategies have pros and cons. Let's start with the con for SQL 2000… If you have a particularly volatile column then the rowmodctr will be increased quickly and then ALL statistics (even those that have not changed) are invalidated. Without a way to understand a column's volatility, you will invalidate some statistics too soon. So, 2005/2008 created a column modification counter to better define when a statistic should be invalidated. But, the con for SQL 2005/2008 (which is the pro for row-based algorithm of SQL 2000) is that a lot more data must change in a single column before the statistics get updated. If your modifications are fairly well balanced (across the table) then the row-based algorithm could suit you better. If your modifications are isolated (to a specific column) then the column-based algorithm will suit you better. But, neither is perfect… and, the new column-based algorithm poses another HUGE problem for filtered indexes.
For filtered indexes the threshold invalidation algorithm is tied SOLELY to the column and not just to the filtered set. So, if your table has 10,000 rows it takes 2,500 modifications in that column to update statistics. If your filtered index only has 1,000 rows… then you could theoretically modify this specific filtered set 2.5 times before IT would be updated. Yes, I realized that this sounds unlikely but it's really not. What if your filter is for ACTIVE = 1. It's only 1% of your table and it's the most active data set. You will effectively have to modify this set 20 times over before the statistics for your filtered index get updated.
But, having said that – I also have a simple solution. Don't wait… For filtered indexes (especially on volatile data) add a special UPDATE STATISTICS job that specifically updates their stats with a FULLSCAN. The good news is that they are likely to be smaller indexes (and therefore easier/less-costly statistics to update) and you'll clearly get a lot better uses out of them. Additional good news is that many filtered indexes are designed to be covering indexes, covering indexes aren't going to rely as heavily on statistics so there are many covered queries that won't care. The following query will get the list of all of your tables/indexes that have filters. Once you have this list you can create a loop to update statistics.
SELECT object_name(object_id) AS [Table Name]
, name AS [Index Name]
, stats_date(object_id, stats_id) AS [Last Updated]
WHERE has_filter = 1
So, there are lots of options and definitely a few things of which to be aware… filtered indexes and filtered stats are a powerful option to leverage, just make sure you keep those stats up to date!
Cheers… it's time for a pint of Guinness!
PS – For more information on statistics, check out the SQL Server whitepapers:
- Statistics Used by the Query Optimizer in Microsoft SQL Server 2008 (added this after comment #2, please review ALL of the comments as there's a good discussion on the 2008 filtered stats "multiplier" and whether or not it's having the desired effect?! I say no and want to reiterate – filtered index stats need more frequent updating based on your data volatility.)
- Statistics Used by the Query Optimizer in Microsoft SQL Server 2005
- Statistics Used by the Query Optimizer in Microsoft SQL Server 2000