When Were Statistics Last Updated for a Heap?

Potty Training Statistic- No Accident in 4 Days

Potty Training Statistic- No Accident in 4 DaysI got a question last week from a very smart fellow:

How can I tell when statistics were last updated for a heap?

Before I could email him back, I soon got another email in which he answered his own question.

I’m not saying that he’s smart because he answered his own question. I’m saying he’s smart because I’ve met him before, and he’s a very insightful person and a great problem solver!

In his later email, he said he was embarrassed about the question. I guess it felt a bit basic. But it’s not a dumb question at all.

I think of this as a “framework” type question

When I learn things, I usually first accumulate a bunch of facts that I associate together loosely. When I figure out a question like this, the answer often tends to snap all those facts into a stronger framework than I had before, and I get to a deeper understanding.

That last little fact that puts your framework together often feels glaringly obvious afterward. But it certainly didn’t before!

We think of statistics as being for a table in SQL Server

Statistics are little descriptive pieces of information that help the optimizer estimate how data is distributed.

Let’s say I’m SELECTing StateCode from dbo.Address WHERE AddressId = 420101.

If dbo.Address has a unique clustered index on AddressId, it will have a statistic associated with the clustered index that helps it understand what AddressIds are present. It also understands that AddressIds are unique, and therefore there’s at most only one row for AddressId = 420101.

But what if dbo.Address was a heap?

A heap is a table without a clustered index.

If dbo.Address if dbo.Address was a heap, there wouldn’t be a nice unique clustered index with a related statistic to give it this information.

Instead, a few things could happen when this query ran:

  1. If there is a nonclustered index that leads on the AddressId column, it will have a statistic associated with it. If the nonclustered index is unique, then that uniqueness will be known as well.
  2. If there is an existing column statistic, information about the cardinality and data ranges can be used from that
  3. If no index or column statistic exists and the database allows column statistics to be automatically created (which is a default setting), then a little column statistic can be created.

But notice, no “heap” statistic will be created.

There are two types of statistics: index statistics and column statistics

Although we tend of thinking as statistics as being for a table (and I tend to have this thought framework as well), this isn’t really the case.

For indexes, statistics describe the data in the key columns of that index. The leading column of the index is particularly well described, as it gets a lot of detail captured about it in a part of the statistic called the histogram.

Column statistics describe individual columns.

What would a statistic be if it described a heap? A heap doesn’t have key columns. There is no “leading” column defined, it’s simply a table structure without an index.

So: when were statistics last updated for a heap?

For a heap, you can look at individual column statistics as well as index statistics and examine which each of them was last updated, with a query like this:

SELECT 
    stat.auto_created,
    stat.name as stats_name,
    sp.last_updated,
    STUFF((SELECT ', ' + cols.name
        FROM sys.stats_columns AS statcols
        JOIN sys.columns AS cols ON
            statcols.column_id=cols.column_id
            AND statcols.object_id=cols.object_id
        WHERE statcols.stats_id = stat.stats_id and
            statcols.object_id=stat.object_id
        ORDER BY statcols.stats_column_id
        FOR XML PATH(''), TYPE
    ).value('.', 'NVARCHAR(MAX)'), 1, 2, '')  as stat_cols,
    stat.filter_definition,
    stat.is_temporary,
    stat.no_recompute,
    sp.modification_counter,
    sp.rows,
    sp.rows_sampled
FROM sys.stats as stat
CROSS APPLY sys.dm_db_stats_properties (stat.object_id, stat.stats_id) AS sp
JOIN sys.objects as so on 
    stat.object_id=so.object_id
JOIN sys.schemas as sc on
    so.schema_id=sc.schema_id
WHERE 
    sc.name= 'dbo'
    and so.name='Test'
ORDER BY 1, 2;
GO

Here’s what the first few columns of the results look like for my heap named dbo.Test:

If you’d like to build the dbo.Test heap and play around with it, get code to do so in this Gist.

Got a question?

Send it to me here!

Index Rebuild, Reorganize, and Statistics in Maintenance Plans in SQL Server 2016

Working with maintenance plans is supposed to be easy, but I find it to be quite difficult.

In part, this is because they can often be buggy. To write this post today, I had to go back and install SSMS 16.5, because I wasn’t able to configure logging or change some aspects about my test maintenance plan in SSMS 17. (I use case sensitive instances, and this bug also impacts things like the maintenance plan log dialog box.)

And in part this is because the documentation for maintenance plans doesn’t tend to be as clear as the documentation for TSQL commands. So in the interest of saving other folks time, I wanted to share what I learned about the Rebuild Index Task, Reorganize Index Task, and Update Statistics Task in SQL Server 2016.

You don’t have to use maintenance plans, but lots of people still do

If you already are happily doing your index and statistics maintenance with Ola Hallengren’s free scripts, Minion Reindex, or a different solution that works well for you – I’m not writing this post to change your mind! Not at all. Go forth and be happy.

If you’re using maintenance plans because you’re not comfortable with other scripts, or your management has a mandate that you only use the built-in tools, then this post is for you.

And just to be clear, I am writing this post to give you reasons to not use maintenance plans, and to go try out one of those scripts. I’m going to step through the improvements added into maintenance plans in SQL Server 2016, and explain why they still have notable flaws.

To be honest, I don’t blame Microsoft for those flaws: there are such great scripts for this generated by the community that I don’t know why they’d put much effort into making index and statistics maintenance perfect in the product. I wouldn’t! If customers want something more flexible, there are multiple free options out there which are highly configurable. Personally, I’m happier if Microsoft’s developer time goes into more cool features inside the products where community scripts can’t help out.

So please don’t read this post as complaints about the product! Really, I just want to encourage you to use cool community scripts.

The index maintenance tasks are historically very inefficient in SQL Server

For older versions of SQL Server (the ones most people use still, since most folks are slow to upgrade), maintenance plans offer a “defragment everything, every time” approach to index maintenance:

  • The Reorganize Index task reorganizes all indexes, regardless of fragmentation level
  • The Rebuild Index task similarly bulldozes through everything and rebuilds every index, without checking fragmentation

There’s not a lot of guidance in the maintenance plan designer, so it’s not uncommon for people to rebuild every index, and then reorganize it afterwards: defragment one way, then defragment another.

Although this is a lot of wasted effort, if you have small databases and a large maintenance window, it may not be a big problem.

But the bigger your data gets, the longer it takes. And more and more often, people like to be able to work nights and weekends, and they expect performance from the database whenever they happen to be active. That maintenance window is shrinking.

SQL Server 2016 added some features to the Rebuild Index Task

We’ve had a very handy dynamic management view in SQL Server to check for fragmentation since SQL Server 2005, but maintenance plans didn’t give you a built-in way to use it until SQL Server 2016. Better late than never, maybe?

Here’s the updated Index Rebuild task. New stuff is highlighted in yellow:

Things that are better for Index Rebuild for Enterprise Edition: Maxdop, Low Priority

You can now configure MAXDOP for index rebuilds in a maintenance plan. This feature has existed since SQL Server 2005, but you had to use TSQL to specify it before. Using more than one core for index rebuild operations is an Enterprise feature. This will work in Enterprise Edition and Developer Edition, but you’re only going to get single threaded rebuilds in Standard Edition.

You can also configure online index rebuilds to wait at low priority. This feature was added in SQL Server 2014, and it’s specific to the Enterprise Online Rebuild option. It can reduce blocking chains for the locks needed for the index maintenance, and it now gives you options about what you’d like to happen after waiting (just like the TSQL).

Things that are better for all editions: Fragmentation Level and Page Count

The ability to skip indexes that are tiny, or which aren’t very fragmented is a big improvement, and brings this task into the modern world. Partly. (More on that below.)

Things that confused me: Index Stats Options – Scan type

I had a serious case of Wishful Thinking (TM) when it came to the bottom part of this dialog box.

The problem was the phrase ‘index stats options’. I was SURE that the word ‘stats’ here was related to the statistics in SQL Server that describe the data distribution in columns or indexes, that are used by the optimizer when generating execution plans. I thought that a feature had been added to this task where it might update index or column statistics if the index wasn’t fragmented enough to qualify for a rebuild (something that the more sophisticated free scripts above do).  I was thinking that ‘Sampled’ was a dynamic sample, and ‘Detailed’ was perhaps FULLSCAN.

Nope. Nope. Nope.  After much confusion, I realized that in fact this has nothing to do with those stats.

Instead, this part of the dialog lets you control how the task checks the level of fragmentation in the index from the sys.dm_db_index_physical_stats DMV: Fast = ‘LIMITED’, Sampled = ‘SAMPLED’, ‘Detailed’=’DETAILED’. (I did some tracing to confirm the mapping.)

It is kinda nice that 2016 now lets you configure this, although generally you just want to use ‘LIMITED’, aka ‘Fast’. This is the default, and it’s what most free scripts out there do.

The Reorganize Index Task got a little makeover in 2016, too

Here is the task, with the new options highlighted:

Reorganize is always single threaded and online, in every edition, so we don’t have options for that. But we can now skip small indexes, skip indexes that aren’t very fragmented, and control the level of thoroughness of the scan in sys.dm_db_index_physical_stats.

What problems do maintenance plans still have for index and statistics maintenance in 2016?

Maintenance plans still have a few problems when it comes to index and stats maintenance, and the more data you have, the more they’ll hurt.

1) You’ll probably end up sampling the same objects for fragmentation more than once

Let’s say that I configure my maintenance plan like this:

Disclaimer: This is a totally simplified maintenance plan and it doesn’t clean up any of its mess at all. I’m showing this to talk about the problems it has and to try to help you justify using a different solution, not to show you best practices with it 🙂

I configure it conservatively, to save effort:

  • The rebuild task only rebuilds indexes over 50% fragmented which can be rebuilt online
  • The reorganize task only reorganizes indexes over 25% fragmented

What if the rebuild index task finds a 20GB index that’s 60% fragmented and rebuilds it?

Well, when reorganize comes along, it will see that the 20GB index isn’t fragmented — but it still has to take the time to sample it all over again to see that, because these are separate tasks. (I did some tracing to confirm that each task scans the fragmentation individually, and they do.)

In contrast, a clever script will step through the indexes, skip low page counts based on metadata, sample fragmentation once, and then decide whether to rebuild or reorganize.

2) These tasks don’t do anything new for index and column statistics (the other kind of stats)

When it comes to performance, maintaining index and column statistics often makes more of a difference than defragmenting your indexes.

Sure, you need to address index fragmentation sometimes. If you never defragment your indexes, you’ll end up with tons of trapped empty space and bloated indexes – and that wastes space not only on disk, but also in memory.

But for many databases, updating index and column statistics — the little objects that help SQL Server estimate how much data there is — can make a major difference to performance, and it’s helpful to update them on a regular basis.

Generally, you want to integrate statistics maintenance with index maintenance, if you’re running both, because rebuilding an index automatically gives it nice fresh statistics (equivalent to updating them with fullscan).

But maintenance plans don’t have a feature to be smart about these stats.

3) The Update Statistics task is waaaaay nastier than it looks at first

One of my least favorite things about maintenance plans is this task, because SQL Server is so much smarter than this! This task is really outdated and very problematic, but you’d never know until it burns you. Here it is with the default values selected:

Problems with this task:

More clever maintenance scripts:

  1. Skip statistics maintenance on an index that has been rebuilt (because it got a stats update with that operation)
  2. Checks other index-related statistics and column statistics to see if they’ve been modified, and skips them if they haven’t had any action
  3. Allows you to use the default dynamic sampling for statistics update. This dynamic sampling is built into the UPDATE STATISTICS command (the maintenance plan just doesn’t give an option for it). For small objects, it’ll decide to scan the whole thing. For larger objects it will take a sampling of rows.

3b) This means that the imperfect workaround of sp_updatestats lives on

Many folks get wise to the fact that the Update Statistics task isn’t their friend after they realize that it’s making their maintenance take forever. Often what they turn to instead is an ‘Execute TSQL Statement’ task. They use it to execute some code calling the built-in procedure sp_updatestats for each database they care about.

While sp_updatestats is definitely a bit smarter (it skips stats with not changes, and it uses the dynamic sampling), it has some flaws itself. Erin Stellato summarizes the problems with sp_updatestats in this post.

4) You don’t get configuration options for partitioned indexes

Feature recap:

  • In SQL Server 2016 SP1, we got the ability to use table partitioning in Standard, Web, and Express Edition
  • In SQL Server 2014, we got the ability to rebuild individual partitions online in Enterprise Edition. In Standard Edition, rebuild is offline, but you have the choice between doing individual partitions and the whole index.

But I don’t see any options about partitioning in those Index Rebuild and Reorganize dialogue boxes, do you?

So I did some basic testing. I pointed my maintenance plan at a partitioned table, and asked it to script out the TSQL it would use (this is an estimate, not a guarantee). Here’s an excerpt of what it gave me:

USE [BabbyNames]
GO
ALTER INDEX [cx_pt_FirstNameByBirthDate_1966_2015] ON [pt].[FirstNameByBirthDate_1966_2015] REBUILD PARTITION = 5 WITH (SORT_IN_TEMPDB = OFF, ONLINE = ON (WAIT_AT_LOW_PRIORITY (MAX_DURATION = 0 MINUTES, ABORT_AFTER_WAIT = NONE)), DATA_COMPRESSION = ROW)
GO
USE [BabbyNames]
GO
ALTER INDEX [cx_pt_FirstNameByBirthDate_1966_2015] ON [pt].[FirstNameByBirthDate_1966_2015] REBUILD PARTITION = 6 WITH (SORT_IN_TEMPDB = OFF, ONLINE = ON (WAIT_AT_LOW_PRIORITY (MAX_DURATION = 0 MINUTES, ABORT_AFTER_WAIT = NONE)), DATA_COMPRESSION = ROW)
GO
USE [BabbyNames]
GO
ALTER INDEX [cx_pt_FirstNameByBirthDate_1966_2015] ON [pt].[FirstNameByBirthDate_1966_2015] REBUILD PARTITION = 13 WITH (SORT_IN_TEMPDB = OFF, ONLINE = ON (WAIT_AT_LOW_PRIORITY (MAX_DURATION = 0 MINUTES, ABORT_AFTER_WAIT = NONE)), DATA_COMPRESSION = ROW)
GO

So it appears that at least on SQL Server 2016 SP1, it defaults to partition level rebuilds, whether you ask for them or not. It also appears to check the fragmentation level for each partition, and skip those beneath the specified fragmentation level.

If you want to skip any partitions all the time or rebuild some indexes entirely and not at the partition level, there’s no option for that.

5) There’s no good options for exclusion

I may have some large objects in my database that I don’t want to regularly maintain. One common example are logging tables, where data may be frequently inserted, but they are queried only rarely, if internal staff need to investigate a problem. You may have some indexes on the tables to help with these queries, but you don’t care much about their performance.

If you don’t have a lot of time for maintenance, you probably want to skip these tables most of the time, and only do the barest of maintenance once a month, or possibly even less frequently. This can let you focus on the objects you care about in limited time.

Maintenance plans don’t have a good way to do this. You can select specific objects, but not exclude specific objects, and you have to configure it on every task. This can lead to inconsistencies, or to new objects not being picked up by the maintenance plans.

What to do?

If you read all the way through this looooong post, I’m guessing that you’re not completely happy with your maintenance plan. You’d like to improve it.

Good news! There are lots of great ways you can do that. The SQL Server community has lots of clever folks who built them and want to share them with you.

Your best option is to look at the websites for Ola Hallengren’s free scripts and Minion Reindex. Check out the documentation a little. Decide which you’d like to test out, and grab the code and put it on a test server. Make a change plan to replace your maintenance plans with one of those instead.

What if you have a mandate from your management to use the built-in maintenance plans?

This would be my question: is it OK for us to run custom code in maintenance plans? Like for that situation where the built-in Update Statistics command doesn’t work, is it OK for me to run a script that’s smarter about statistics maintenance?

If there’s any leeway at all for that, then you’ve got a little bit of an open door to work with. You can start the process of getting external free scripts and their licenses reviewed.

If there isn’t any leeway at all for that, then I would make the best of what I have, and keep an eye out for places where we have problems due to the limitations in maintenance plans. If you come across incidents where using an improved script might work, then you have a good opportunity to bring it up again in a friendly way.

How Much Longer Does it Take To Update Statistics with FULLSCAN?

When I was recently working on the course, “Should Developers Manage Index Maintenance?” I explained that in my experience, statistics maintenance can make more of a difference to performance than index maintenance can.

I also noted that one of the big “maintenance goofs” that I’ve made in the past is to be overly eager to update statistics. And to update them with FULLSCAN.

Here’s some detail on why doing that can be so slow, and how it can eat up more resources than you might think. (This is a long one, so scroll on down to the end of the post for a list of spoilers, if you like.)

When you update statistics against a table, SQL Server may have to scan the table many, many times

When people manually update statistics, they generally don’t update just a single column stats, or stats for a single index. They identify a table that’s a problem, and create a command to update stats against the whole table.

The command to update statistics against an entire table looks something like this:

UPDATE STATISTICS dbo.FirstNameByBirthDate_1966_2015;
GO

But then, if they’ve gone to this trouble, they think, “I should try to make the updated statistics as accurate as possible!”

The obvious way to do that it to tell SQL Server to do more than just take a sample of the data: instead to do it with FULLSCAN.

So they use a command like this:

UPDATE STATISTICS dbo.FirstNameByBirthDate_1966_2015 WITH FULLSCAN;
GO

How much longer does FULLSCAN take?

On my test instance, the command that uses the default sampling takes 6 seconds to complete.

The command which adds “WITH FULLSCAN” takes just over five minutes to complete.

The reason is that those two little words can add a whole lot of extra IO to the work of updating statistics.

What is update statistics really doing?

My table is pretty narrow. It has only six statistics on it: I queried information about them with a query like this.

The three statistics with the funny names beginning in _WA_Sys are column statistics that SQL Server automatically created when I ran queries with joins or where clauses using those columns. The other three statistics were automatically created when I created indexes.

I ran a trace when I updated statistics with FULLSCAN, and here’s what I saw, stat by stat…

1) FakeBirthDateStamp and FirstNameByBirthDateId (cx_FirstNameByBirthDate_1966_2015)

The clustered index has a two-column statistic. Those two columns match up with this query in the trace:

SELECT StatMan([SC0], [SC1]) FROM 
(SELECT TOP 100 PERCENT [FirstNameByBirthDateId] AS [SC0], [FakeBirthDateStamp] AS [SC1] FROM [dbo].[FirstNameByBirthDate_1966_2015] WITH (READUNCOMMITTED)  ORDER BY [SC0], [SC1] ) 
AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 4)

Here’s the plan I saw in the trace:

There are only two columns output from the clustered index scan: FirstNameByBirthDateId and FakeBirthDateStamp, the two columns needed to update the stat.

(Note: this plan does seem a bit odd to me, as it could have found these columns in a non-clustered index: these are the key columns in the clustered index, which will be present in each nonclustered index, whether we ask for it or not. But it chose to use the clustered index.)

2) FakeBirthDateStamp column stat (_WA_Sys_00000001_4DB4832C)

Next, SQL Server went to work on the FakeBirthDateStamp column statistic.

“But wait!” you might think. “We just got that info when updating the stat for the clustered index!”

Yes, we did. But we’re going to go scan something else, anyway, because that’s how we roll. We need to independently collect data for each statistic, even though you ran the command against the whole table.

Here’s the query that SQL Server runs next, which lines up with the column stat:

SELECT StatMan([SC0]) 
FROM (SELECT TOP 100 PERCENT [FakeBirthDateStamp] AS [SC0] 
FROM [dbo].[FirstNameByBirthDate_1966_2015] WITH (READUNCOMMITTED)  ORDER BY [SC0] ) 
AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 4)

Here is the plan it used to get the data to update the column statistic:

This time, SQL Server chose to scan the nonclustered index on FirstNameId. Looking at the properties of the scan, it figured out that FakeBirthDateStamp would be there (because of the clustering key), and decided to scan this nonclustered index and output just that column:

But … oops! We didn’t allocate enough memory for our sort and had a little spill in tempdb.

3) BirthYear (_WA_Sys_00000003_4DB4832C)

If you’re still reading, here’s where things get crazier than I expected.

BirthYear is a computed column. SQL Server uses the following query to gather data to update my column statistic…

SELECT StatMan([SC0]) 
FROM (SELECT TOP 100 PERCENT [BirthYear] AS [SC0] 
FROM [dbo].[FirstNameByBirthDate_1966_2015] WITH (READUNCOMMITTED)  
ORDER BY [SC0] ) AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 4)

And here’s the plan I got:

Here’s the play by play of what that plan is doing:

  • Scan the nonclustered index on FirstNameId, and output the “secret” FakeBirthDateStamp column (which is there because it’s part of the clustered index)
  • Use compute scalar operators to apply the formula for the BirthYear column and compute it for every row
  • Then we have our tempdb spill again

Computed columns can have statistics on them. That’s a good thing. I don’t have a non-clustered index on this column for SQL Server to scan, but I was surprised that it wanted to re-compute every single row for it, because I did mark this column as ‘persisted’ (I double-checked with a query).

But this time, SQL Server really didn’t want to scan that clustered index again (we just did it, after all), so it decided to recompute every. Single. Row.

We’re not done yet, though. We’re only halfway through the statistic!

4) FirstNameId (ix_FirstNameByBirthDate_1966_2015_FirstNameId)

We’ve scanned every row in the nonclustered index on FirstNameId twice already. But we haven’t actually updated its statistic yet, so… you guessed it, let’s scan it again!

The query to gather data to update this stat is…

SELECT StatMan([SC0], [SC1], [SC2]) 
FROM (SELECT TOP 100 PERCENT [FirstNameId] AS [SC0], [FirstNameByBirthDateId] AS [SC1], [FakeBirthDateStamp] AS [SC2] 
FROM [dbo].[FirstNameByBirthDate_1966_2015] WITH (READUNCOMMITTED)  
ORDER BY [SC0], [SC1], [SC2] ) AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 4)

And the plan was…

Look on the bright side: it may be the third time we’ve scanned this nonclustered index, but at least this time we didn’t have any tempdb spills.

5) StateCode, Gender (ix_FirstNameByBirthDate_1966_2015_StateCode_Gender_INCLUDES)

We’ve got another nonclustered index, and it has two key columns. Those two columns are in the auto-generated index statistic. To gather information for them, SQL Server runs this query:

SELECT StatMan([SC0], [SC1], [SC2], [SC3]) 
FROM (SELECT TOP 100 PERCENT [StateCode] AS [SC0], [Gender] AS [SC1], [FirstNameByBirthDateId] AS [SC2], [FakeBirthDateStamp] AS [SC3] 
FROM [dbo].[FirstNameByBirthDate_1966_2015] WITH (READUNCOMMITTED)  ORDER BY [SC0], [SC1], [SC2], [SC3] ) 
AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 4)

This gets a plan that scans the associated nonclustered index:

6) Gender (_WA_Sys_00000006_4DB4832C)

Whew, I’m glad this is a narrow table. We have a column statistic on Gender. This might look like a duplicate stat to the index statistic — but note that the index statistic leads on StateCode. That turns out to make it quite different (because only the leading column in a statistic gets information in the histogram). So the column statistic on Gender only is really quite different.

Again, SQL Server can’t re-use any of the information it previously scanned. It runs this query:

SELECT StatMan([SC0]) 
FROM (SELECT TOP 100 PERCENT [Gender] AS [SC0] 
FROM [dbo].[FirstNameByBirthDate_1966_2015] WITH (READUNCOMMITTED)  ORDER BY [SC0] ) 
AS _MS_UPDSTATS_TBL  OPTION (MAXDOP 4)

Which gets this execution plan:

Confession: this execution plan makes me a little sad. I knew I wasn’t going to get a happy ending for this post, but I was really rooting for it to at least scan the nonclustered index on StateCode and Gender (which is physically much smaller).

Nope. It decided to scan the whole clustered index, again, this time to output the Gender column. And it spilled 97K pages in tempdb.

What I used to see this info

I set up a quick and sloppy extended events trace to get the following events:

  • query_post_execution_showplan – this impacts performance when you collect it, so I also updated statistics without the trace running to measure basic timing
  • sp_statement_completed – this shows you information for each “SELECT StatMan” statement run behind the scenes
  • sql_statement_completed – this gives you overall information for the whole ‘UPDATE STATISTICS dbo.FirstNameByBirthDate_1966_2015 WITH FULLSCAN;’ when it finishes

The query execution plan screenshots are from the free SentryOne Plan Explorer, taken with Snagit.

Summing up and takeaways

Here’s what you need to know:

  • If you must manually update statistics, stick with the default sampling (unless you have a great reason to do otherwise)
  • If you must manually update statistics, update a specific column or index stat only (unless you have a great reason to do otherwise)
  • If you can come up with a better way to get stabile high performance, such as tuning queries or indexes, it removes this headache

This is because:

  • For every statistic you update (even as part of a command to update statistics on a whole table), SQL Server has to do a separate read operation
  • When you use FULLSCAN, this means scanning a table or an index
  • SQL Server may choose to recompute all the values in a computed column, even if it’s persisted, when you update that column statistic with FULLSCAN
  • SQL Server may not always choose to scan a nonclustered index instead of the clustered index, even when that seems like a viable option
  • Updating statistics can generate activity you might not expect, like spills in tempdb, if it underestimates how many resources it will need for things like SORT operators in the plan

Who knew that so much weirdness could come from such a simple command?!?!?

 

When a Nonclustered Index and Statistics Make a Query Slower

Nonclustered indexes are awesome in SQL Server: they can get you huge performance gains

But we can’t always create the perfect index for every query. And sometimes when SQL Server finds an index that isn’t quite perfect and decides to use it, it might make your query slower instead of faster.

Here’s an example of why that can happen, and some reasons why if you hit something like this, the FORCESCAN table hint may help (and also why you should be careful with that).

The setup

I’ve restored the large BabbyNames sample database and created a couple of indexes:

USE BabbyNames;
GO

CREATE INDEX ix_ref_FirstName_INCLUDES on ref.FirstName (FirstName) INCLUDE (FirstNameId);
GO

  
CREATE INDEX ix_dbo_FirstNameByBirthDate_1966_2015_FirstNameId on dbo.FirstNameByBirthDate_1966_2015 (FirstNameId)
GO

Meet my slow query

My query has a simple join, and a GROUP BY clause. We’re looking for the count of all the babies born named “Matthew”, grouped by Gender:

SELECT Gender, COUNT(*)
FROM dbo.FirstNameByBirthDate_1966_2015 AS fnbd
JOIN ref.FirstName as fn on
  fnbd.FirstNameId=fn.FirstNameId
where fn.FirstName='Matthew'
GROUP BY Gender;
GO

SQL Server thinks this query is going to be cheap

The query has cost of 19.75. SQL Server estimates that it’s going to get 6,016 rows back from dbo.FirstNameByBirthDate_1966_2015 for all those Matthews.

Since that’s not a lot of rows, it decides to use the narrow nonclustered index on FirstNameId to find the Matthews, and then go do a key lookup to pick up the Gender column in the clustered index of the table. Here is what the estimated execution plan looks like:

But SQL Server is underestimating the number of Matthews by a lot

This table has a row for every baby named Matthew in the United States between 1966 and 2015. There are a lot more Matthews than 6K. Our estimate is off by around 1.4 million.

When SQL Server chooses this plan, the query takes around 13 seconds to execute. Running all those nested loops with a single thread isn’t quick!

Spoiler: if this nonclustered index didn’t exist (or wasn’t selected), the query would execute in around 5 seconds (instead of 13 seconds), just scanning the clustered table.

Why is the estimate so low?

I’m using the new cardinality estimator in SQL Server, so I used a few trace flags (more on those from Paul White here) to get SQL Server to write out some information about which statistics it used to my ‘Messages’ tab when I compiled the query:

SELECT Gender, COUNT(*)
FROM dbo.FirstNameByBirthDate_1966_2015 AS fnbd
JOIN ref.FirstName as fn on
  fnbd.FirstNameId=fn.FirstNameId
where fn.FirstName='Matthew'
GROUP BY Gender
OPTION
(   RECOMPILE,
    QUERYTRACEON 3604,
    QUERYTRACEON 2363
)

Among lots of other details, it told me “Loaded histogram for column QCOL: [fnbd].FirstNameId from stats with id 4”

I used dynamic management views to confirm that my statistic with id 4 on FirstNameByBirthDate_1966_2015 is the stat for the index ix_dbo_FirstNameByBirthDate_1966_2015_FirstNameId.

So let’s look at that stat!

DBCC SHOW_STATISTICS (FirstNameByBirthDate_1966_2015, ix_dbo_FirstNameByBirthDate_1966_2015_FirstNameId);
GO

Here’s the statistic, with a few minor embellishments…

click for a larger image

Things to notice, by number:

  1. “Rows Sampled” in the header is equal to all the rows in the table. This statistic was created when the index was created, and so it automatically got “fullscan”. SQL Server sampled all of the rows when it created this stat.
  2. The blue highlighted row is in the section called the Histogram of the statistic. The FirstNameId for ‘Matthew’ is 28,073. This value happened to get its own row in the histogram, and SQL Server knows that there are approximately 1,451,969 rows in this table of Matthews. That’s way more than the estimate of  6,016! The histogram didn’t get used.
  3. The “All density” number for just the FirstNameID column is 3.774155E-05 — aka 0.00003774155. This number is in what’s called the “density vector” for the index, and it’s used for when SQL Server is going to guess how many rows come back for an “average” FirstNameId (not a specific FirstNameId). And looking at our query, we didn’t give it a specific FirstNameId. We put the predicate over on ref.FirstName, using the FirstName column.
  4. Rows = 159,405,121. This is the other ingredient to our formula to “guess how many rows exist for an average FirstNameId.”
    • All density * rows =  0.00003774155 * 159,405,121 = 6,016.19634447755

To be clear, the problem here is not the new cardinality estimator. The legacy cardinality estimator also chooses a nested loop plan and is just as slow for this query.

The slow performance is because of the way we wrote our query

We wrote our query putting a predicate on FirstName on the dimension table, ref.FirstName, then joined over to dbo.FirstNameByBirthDate_1966_2015. SQL Server has to generate the execution plan for the query before it runs. It can’t query ref.FirstName and find out what the FirstNameId is for Matthew, then use that to figure out what kind of join to use.

(At least not yet. Plans like this can potentially be fixed by the Adaptive Join feature in future releases, but it doesn’t cover this in the initial SQL Server 2017 release. They can’t tackle everything at once.)

Instead, SQL Server has to say, “Well, for any given name that I join on, what looks like the best bet?”

And that’s why it uses the density vector on the statistic multiplied by the rowcount. There’s nothing wrong with the statistic.

What if we wrote our query differently and specified the FirstNameId for Matthew?

SELECT Gender, COUNT(*)
FROM dbo.FirstNameByBirthDate_1966_2015 AS fnbd
JOIN ref.FirstName as fn on
  fnbd.FirstNameId=fn.FirstNameId
where fn.FirstNameId = 28073
GROUP BY Gender

In this case, SQL Server looks at the histogram for the FirstNameId and gives us a very different estimated plan:

But it can only do this because we gave it the FirstNameId.

Knowing about all those Matthews at the time of compilation, SQL Server decides it may as well just go ahead and scan the clustered index using parallelism. It also asks for a nonclustered index on Key (FirstNameId) INCLUDE (Gender).

This query takes around 5 seconds (instead of 13 seconds). Scanning that clustered index isn’t awesome, but it’s more than twice as fast as using the imperfect nonclustered index on FirstNameId and then going back to look up Gender.

Is there another way to fix this?

This is a pretty simple example. You can’t always just change the query to specify just the right predicate like FirstNameId = 28073.

Sometimes, the FORCESCAN table hint can help. Here’s what the documentation explains (in brief, click to read more detail including limitations):

Introduced in SQL Server 2008 R2 SP1, this hint specifies that the query optimizer use only an index scan operation as the access path to the referenced table or view. The FORCESCAN hint can be useful for queries in which the optimizer underestimates the number of affected rows and chooses a seek operation rather than a scan operation.

If we know that every time we want this query to run, we do not want it to do a seek on an index, we could add this hint. We don’t even have to specify an index with it, we can do this:

SELECT Gender, COUNT(*)
FROM dbo.FirstNameByBirthDate_1966_2015 AS fnbd with (FORCESCAN)
JOIN ref.FirstName as fn on
  fnbd.FirstNameId=fn.FirstNameId
where fn.FirstName='Matthew'
GROUP BY Gender;
GO

We get a slightly different estimated plan…

But again the query completes in around 5 seconds instead of 13.  And again it registers a request for an index on Key (FirstNameId) INCLUDE (Gender).

What if we use FORCESCAN and then someone adds the perfect index?

Let’s say that we found this slow query, and for whatever reason it didn’t make sense to give it the “perfect” index. So we added a FORCESCAN hint to make it faster.

But later on, we found other queries who would use the same index that this query wanted. And eventually we ditched the index on just FirstNameId and replaced it with:

CREATE INDEX ix_dbo_FirstNameByBirthDate_1966_2015_FirstNameId_Includes on dbo.FirstNameByBirthDate_1966_2015 (FirstNameId) INCLUDE (Gender)
GO

This index is great for this query! It makes our original query (with no hints) complete in less than half a second, using a simple index seek!

But if we have that FORCESCAN hint in place, SQL Server can use the new, perfect index but…. yeah, it’s forced to scan it. The hint name isn’t kidding.

Forcing the scan of the perfect index takes around six seconds.

Takeaways

When you add a non-clustered index, it’s possible for this to cause performance regressions in some of your queries.

When you see a low estimate in a query in SQL Server, the problem isn’t always that the statistics are “bad”. SQL Server may have very accurate data in the statistics, but may not be able to use it perfectly because of the way the query is written.

The FORCESCAN table hint can be used to change the behavior of a query without specifying an index  by name — but if you use this hint, you need to periodically check and make sure that the hint is still making the query faster, not slower.

Want to learn about more query hints?

Check out the course, Query Tuning with Hints & Optimizer Hotfixes!