Filtered indexes and filtered stats might become seriously out-of-date

Time flies when you're having fun, eh? This week Paul and I are in Dublin, Ireland delivering a deep technical "SQL Immersion Event" on Internals, Performance Tuning and Maintenance with our partner ProData (and having a blast – we have a really fun and very interactive group). And, today's an even more interesting day to be in Dublin as it's the 250th anniversary of the Guinness Brewery…everyone around Dublin will be toasting the most popular drink in Ireland today (Sept 24) at 5:59pm GMT. Unfortunately, our class runs until 6pm but Paul tells me that we're still going to celebrate. ;-)

For now, we'll just have to celebrate some SQL (I can hear minor grumblings everywhere :)…

Filtered indexes and filtered stats are an incredibly powerful feature in SQL Server 2008 – one that I've posted on before and will again. Today's post is about some of the gotchas with filtered index stats and filtered stats. Cutting to the punchline – you might have some trouble with their accuracy as your data changes. The SIMPLE solution is to update these specific stats manually and more frequently (meaning that you don't want to rely on the database option AUTO_UPDATE_STATISTICS). In fact, depending on your data volatility, you might want to update these statistics on a VERY frequent basis. Part of the question you'll have to answer is how much of the filtered set changes over the course of a day/week/month. And, in whatever frequency you reach 5-10% (let's say weekly), I'd suggest updating the statistics weekly. If you update 5-10% within the course of a day – then daily. Let me give you the full story…

Statistics are traditionally updated when roughly 20% (+ a minimum of 500 rows) of the data has changed. If you have a table with 100,000 rows then SQL Server will hit the threshold at 20,500 rows (just a tad over 20%) and a table with 10,000 rows will hit the threshold at 2,500 rows (which is 25%). Additionally, in SQL Server 200x (2000, 2005 and 2008) statistics are NOT immediately updated when the threshold is reached, instead they are invalidated. It's not until someone needs the statistic that SQL Server updates it. This reduces thrashing that occurred in SQL Server 7.0 when stats were updated immediately instead of just being invalidated. Another interesting point is what is meant by "20% of the data has changed?"… How is that defined? Is it based on updates to columns or inserts of rows? Of course the answer is… it depends – here, it depends on the version of SQL Server that you're using:

  • SQL Server 2000 defines 20% as 20% of the ROWS have changed. You can see this in sysindexes.rcmodctr.
  • SQL Server 2005/8 defines 20% as 20% of the COLUMN data has changed. You cannot see this unless you are accessing SQL Server through the DAC as it's in a base system table (2005: sysrowsetcolumns.rcmodified and for 2008: sysrscols.rcmodified).

Now, for regular indexes (those that do not include a filter) both strategies have pros and cons. Let's start with the con for SQL 2000… If you have a particularly volatile column then the rowmodctr will be increased quickly and then ALL statistics (even those that have not changed) are invalidated. Without a way to understand a column's volatility, you will invalidate some statistics too soon. So, 2005/2008 created a column modification counter to better define when a statistic should be invalidated. But, the con for SQL 2005/2008 (which is the pro for row-based algorithm of SQL 2000) is that a lot more data must change in a single column before the statistics get updated. If your modifications are fairly well balanced (across the table) then the row-based algorithm could suit you better. If your modifications are isolated (to a specific column) then the column-based algorithm will suit you better. But, neither is perfect… and, the new column-based algorithm poses another HUGE problem for filtered indexes.

For filtered indexes the threshold invalidation algorithm is tied SOLELY to the column and not just to the filtered set. So, if your table has 10,000 rows it takes 2,500 modifications in that column to update statistics. If your filtered index only has 1,000 rows… then you could theoretically modify this specific filtered set 2.5 times before IT would be updated. Yes, I realized that this sounds unlikely but it's really not. What if your filter is for ACTIVE = 1. It's only 1% of your table and it's the most active data set. You will effectively have to modify this set 20 times over before the statistics for your filtered index get updated.

But, having said that – I also have a simple solution. Don't wait… For filtered indexes (especially on volatile data) add a special UPDATE STATISTICS job that specifically updates their stats with a FULLSCAN. The good news is that they are likely to be smaller indexes (and therefore easier/less-costly statistics to update) and you'll clearly get a lot better uses out of them. Additional good news is that many filtered indexes are designed to be covering indexes, covering indexes aren't going to rely as heavily on statistics so there are many covered queries that won't care. The following query will get the list of all of your tables/indexes that have filters. Once you have this list you can create a loop to update statistics.

SELECT object_name(object_id) AS [Table Name]
       , name AS [Index Name]
       , stats_date(object_id, stats_id) AS [Last Updated]
FROM sys.stats
WHERE has_filter = 1

So, there are lots of options and definitely a few things of which to be aware… filtered indexes and filtered stats are a powerful option to leverage, just make sure you keep those stats up to date!

Cheers… it's time for a pint of Guinness!

kt

PS – For more information on statistics, check out the SQL Server whitepapers:

17 thoughts on “Filtered indexes and filtered stats might become seriously out-of-date

  1. Yes, sp_updatestats will work… and, it’s a great point as sp_updatestats is solely based on the index’s rowmodctr. However, sp_updatestats is a bit of a sledgehammer for stats because it will update ALL statistics and indexes if there’s even one row that’s changed (@ind_rowmodctr <> 0). Also, it allows you to supply a sampling rate and even if the index’s stats were recently updated with a FULLSCAN (because of an index rebuild), this will overwrite your statistics with a sampled set (which is OK if the data is perfectly evenly distributed but possibly very bad depending on how skewed your data really is).

    And, @Nicolas… LOL. Thanks!! I didn’t even know that got a rewrite. Better go read! Now I want to see how much they mention about filtered indexes/stats.

    Cheers!
    kt

  2. The SQL 2008 version of the whitepaper (http://msdn.microsoft.com/en-us/library/dd535534.aspx) states for filtered statistics:

    "To determine when statistics need to be automatically updated, the query optimizer uses column modification counters. For filtered statistics, the counter is multiplied by the selectivity of the filter, which ranges from 0 to 1, to compensate for the fact that the column counter tracks changes to all rows in the table, not the changes to the rows in the filtered statistics object."

    Time to haul out some code and run some tests….

  3. Hey there Gail – Yep, I read that… and, I’m not seeing that… which is the reason for the post. Feel free to do more testing. But, even if it does get close to even 20%; the problem with filtered indexes is that they’re so specific that more frequent stats will be warranted even with that. But, again, I wasn’t seeing it.

    Let me know if you see differently!

    Cheers!
    kt

  4. Just did a quick test. A million row table with a filtered index defined over 10000 rows. I had to update the column in that filtered index 257000 times (some rows likely getting updated multiple times) before the stats update fired for the filtered index. An auto-created stats set on the same column (not filtered) fired off an auto-update a few thousand updates earlier. I was starting to think it was never going to fire.

    I want to do some more comprehensive tests, as I’m doing a session on statistics at PASS but, for now, looks like either whitepaper is wrong, I’m doing something wrong or there’s a bug somewhere.

  5. Gail – Yeah… this is where I was at and why I posted this post. Having said that, even with the multiplier (if it does exist) it’s still expecting [potentially] a much larger number of rows (on average) to be modified. As a result, and especially when the modifications are localized to the filtered set (the best example is WHERE Active = 1) then the stats are just never going to be able to keep up. So… I’d still manage them manually as a matter of a best practice!

    Cheers (and, I’ll see you at PASS :),
    kt

  6. I am also seeing that a histogram is not created for very selective filtered stats unless full scan is used. This goes for creation and updates. The bad thing is that even if you update with full scan, auto update can come behind you and blow them away if you have it enabled. Although it might take a while, based on your find.

    I am going to post a repro script later.

  7. Jason – Good point! Something that we can do for filtered stats (since I’m generally recommending micro managing them anyway) is to add STATISTICS_NORECOMPUTE to the index creation or NORECOMPUTE to the stats creation. At that point, at least you won’t lose the FULLSCAN stats from an auto update (if/when it does actually happen).

    So, the best part – is that at least we can control them!

    Cheers,
    kt

Leave a Reply

Your email address will not be published. Required fields are marked *

Other articles

Wow! Wow! Wow! THANK YOU!

I announced my retirement from SQL/tech here and your comments on my blog, on LinkedIn, and on Facebook were overwhelming and humbling! I’m so touched

Explore

Imagine feeling confident enough to handle whatever your database throws at you.

With training and consulting from SQLskills, you’ll be able to solve big problems, elevate your team’s capacity, and take control of your data career.