Wednesday, September 29, 2010

Index Operations Showdown: Drop & Create vs. Create With Drop_Existing vs. Rebuild

Every now and then you may have had to move nonclustered indexes between filegroups. There are two ways it can be done: drop the existing indexes first then create new ones or execute a create statement with the DROP_EXISTING = ON option. At face value it may appear that SQL Server will do the same work regardless of which way you choose but in reality one of them will result in significantly higher overhead than the other. I'm going to show the differences between the two, and for fun I'll also look at what SQL Server is doing under the hood when you rebuild a nonclustered index.

(Don't care about how and what I did, trust me that I'm right, and just want to get to the endgame? Then skip to the conclusions. Otherwise, read on…)

The Setup
To get a real world example I used a production database that's been restored on a test server running SQL 2008 Standard SP1 CU 9 on Windows Server 2003 Standard x86. Data + Indexes take up ~20 GB, there are two filegroups - PRIMARY and INDEXES, and all clustered indexes have been rebuilt with a FILLFACTOR of 100. The SQL service was restarted in between each test to ensure that nothing hanging around in cache from a previous test influenced the next. The tests I ran were:

  1. Drop indexes in the PRIMARY filegroup and create new indexes in the INDEXES filegroup. Do the same thing back the other direction.
  2. CREATE INDEX with DROP_EXISTING = ON, indexes originating in the PRIMARY filegroup and created in the INDEXES filegroup. Again, do the same thing back the other direction
  3. Index rebuild, once with the indexes in the PRIMARY filegroup and once with the indexes in the INDEXES filegroup

In each test I specified PADINDEX = ON, FILLFACTOR = 80 to leave 20% free space in the intermediate and leaf pages of the nonclustered indexes. I used profiler to capture reads, writes, CPU, and duration for the batch and the sys.dm_io_virtual_file_stats DMV to see reads and writes specific to the files in the PRIMARY and INDEXES filegroups.

Results
Here's the breakdown of each test run:

Operation From Filegroup To Filegroup Profiler sys.dm_io_virtual_file_stats
Reads Writes CPU Duration (ms) Filegroup Reads Writes
Create w\ Drop Existing PRIMARY INDEXES 1,631,186 445,530 699,512 196,874 PRIMARY 15,202 149
INDEXES 9 17,297
Create w\ Drop Existing INDEXES PRIMARY 1,634,950 445,563 712,171 197,892 PRIMARY 189 21,844
INDEXES 9,566 50
Drop & Create PRIMARY INDEXES 3,639,243 445,796 1,787,090 518,424 PRIMARY 106,298 248
INDEXES 34,517 53,089
Drop & Create INDEXES PRIMARY 3,662,077 445,976 1,780,435 518,943 PRIMARY 140,967 56,415
INDEXES 8 90
Rebuild PRIMARY PRIMARY 1,634,969 445,546 693,139 192,294 PRIMARY 14,998 21,882
INDEXES 0 0
Rebuild INDEXES INDEXES 1,635,322 445,513 707,967 195,565 PRIMARY 171 79
INDEXES 9,237 18,286

Observant eyes may notice the huge difference between reads & writes coming from profiler vs. sys.dm_io_virtual_file_stats. On the read side it's because profiler is reporting logical reads and sys.dm_io_virtual_file_stats is showing physical reads. Writes are a different story - profiler is reporting physical writes and BOL indicates sys.dm_io_virtual_file_stats shows the "Number of writes made on this file". Louis Davidson seems to think that means physical writes and so does Dave Turpin, but clearly the profiler and DMV numbers don't match up. Also interesting is the consistency between writes reported by profiler whereas writes from sys.dm_io_virtual_file_stats are all over the map. In any case, for this exercise the difference doesn't matter as I'm not comparing the capture methods against one another; instead, I'm using them to support each other relative to each index operation.

Conclusions
Based on the test results we can draw a few conclusions:

  • The reads & writes from sys.dm_io_virtual_file_stats show that when rebuilding\creating the index with DROP_EXISTING = ON the SQL engine is reading from the existing index pages and not from the clustered index. However, when creating a new nonclustered index the engine will read from the clustered index (or heap if no clustered index exists).Rebuilding a nonclustered index and creating the index with DROP_EXISTING = ON have the same I/O and CPU cost and will take roughly the same amount of time to complete.
  • Dropping an index first and then creating it again is an average of 2-3 times more costly in I/O, CPU, and duration vs. rebuilding\creating the index with DROP_EXISTING = ON.

So at the end of the day the lesson here is that if you need to move a nonclustered index to another filegroup stick with the CREATE INDEX…WITH (DROP_EXISTING=ON) syntax. Now you know!

Wednesday, September 15, 2010

Presenting At Sarasota SQL Server Users Group On Sept. 21

On Tuesday, September 21, I'll be at the Sarasota SQL Server Users Group to present Paging DR Availability, You're Wanted in the Recovery Room. This is my first time at the Sarasota UG and I'm excited to meet new people, share something useful, and sneak in a word about the upcoming SQLRally. Here's what I'll be covering:

There are a lot of options when it comes to disaster recovery and high availability with regards to SQL Server. Some, such as log shipping, replication, clustering, and mirroring ship with SQL Server. Others, such as disk level replication and virtualization rely on third party products. Most business owners (and many DBA’s) lump disaster recovery (DR) and high availability (HA) together and while they do share some pieces, they call for different strategies. There are lots of options. Do you pick one or more than one of these options, and based on what? When does it make sense to move from SQL mirroring to SAN mirroring? Does it ever make sense to do log shipping and replication on the same database?  Even if you haven’t yet mastered all of these options, it’s incredibly important that you understand the decision tree that helps you pick the right one(s) for your business,  and that you can explain the choices clearly to the stakeholders. It’s not as complicated as it sounds, but it is complicated – and in this hour presentation we’ll give you a high level understanding of the options, the costs, complexities, and reasons for using each of them.

The meeting starts at 6 PM and is being held at the Community Foundation of Sarasota County, 2635 Fruitville Rd. Sarasota, FL 34237

I hope to see you there!

Friday, September 10, 2010

PASS Summit Logo Design Contest

It appears our success with the SQLRally Logo Design Contest has struck a chord with PASS HQ. This morning Rick Heiges (Blog | Twitter) made an announcement on Twitter:

Summit Logo Contest Announcement

I'm absolutely thrilled that PASS thought highly enough of our SQLRally contest results to run the same kind of contest for THE top SQL conference in the world. I'm equally excited to see what kind of logos we get. One of the lessons the SQLRally team learned from the previous contest was that some designers may hold back their best work from open contests in an attempt to keep other people from copying their ideas. In that vein, this new contest is "blind" - designers can only see their own entries and submitted designs won't be revealed until the contest ends.

Interested in entering? Go check out the design brief. Hurry though, the contest is only open for one week!

Thursday, September 9, 2010

More On Overlapping Statistics

In my last post I provided a script to identify overlapping statistics but realized afterwards that I left out a few points that complete the big picture about why overlapping statistics matter.

What Causes Overlapping Statistics
Boiling this down to basics, here's how it happens: AUTO_CREATE_STATISTICS is set to ON, SQL Server creates some column statistics based on queries that get executed, and then an index is created which creates an overlapping set of statistics on the same column. One real world example of how this can happen is when applying a missing index suggested by SQL Server to help improve performance. Looking at missing indexes is a very common performance tuning practice and I suspect that many people have unknowingly created overlapping statistics but applying missing index suggestions.

Why Overlapping Statistics Can Cause Problems
Paul White (Blog | Twitter) pointed me at Batch Compilation, Recompilation, and Plan Caching Issues in SQL Server 2005 and Plan Caching in SQL Server 2008, two Microsoft whitepapers which contain an (easily overlooked?) snippet about how the query optimizer decides which set of statistics to use when there are overlapping statistics on a column. The answer, in a nutshell, is "it's complicated". There are guidelines, e.g. preference is given to statistics computed with FULLSCAN, but neither resource goes into any level of depth about how it really works. I'm not a Microsoft insider and I don't have access to the specific rules so my way of thinking is why leave it to chance? If I know I've got only one set of statistics on a column and the query plan goes wonky because of something statistics related I have more control over dealing with it. Otherwise leaving overlapping statistics in place can lead to the odd behavior I've previously blogged about.

I hope the additional information helps.

Wednesday, September 8, 2010

Tuning Tip: Identifying Overlapping Statistics

Statistics are used by SQL Server's query optimizer to help determine the most efficient execution plan for a query. When the option to automatically create statistics is enabled (which it is by default) SQL Server will create statistics on columns used in a query's predicate as necessary, which usually means when statistics don't already exist for the column in question.

Statistics are also created on the key columns of an index when the index is created. SQL Server understands the difference between auto created column statistics and index statistics and maintains both - you can see this by querying the sys.stats system view. As I found out firsthand not too long ago having both auto created statistics and index statistics on the same column caused the query optimizer to choose a different - and less than optimal - execution plan than when only the index statistics existed.

According to the MSDN article Statistics Used by the Query Optimizer in Microsoft SQL Server 2000 auto created statistics are automatically dropped over time if they are not used but that can take an undetermined amount of time. What if we're experiencing the kind of problem I previously wrote about? In most of the cases I've seen it makes sense to help SQL Server out by manually dropping the auto created statistics in favor of index statistics that exist for the same column.

Fortunately SQL Server contains everything we need to know to figure out when column statistics are overlapped by index statistics. The following query will identify overlapped\overlapping statistics and generate the statements you can use to drop the overlapped statistics. All the usual warnings apply here - although this has not caused any problems for me your mileage may vary so wield this with an appropriate degree of caution:

WITH    autostats ( object_id, stats_id, name, column_id )
         
AS ( SELECT   sys.stats.object_id ,
                       
sys.stats.stats_id ,
                       
sys.stats.name ,
                       
sys.stats_columns.column_id
              
FROM     sys.stats
                       
INNER JOIN sys.stats_columns ON sys.stats.object_id = sys.stats_columns.object_id
                                                       
AND sys.stats.stats_id = sys.stats_columns.stats_id
              
WHERE    sys.stats.auto_created = 1
                       
AND sys.stats_columns.stats_column_id = 1
            
)
   
SELECT  OBJECT_NAME(sys.stats.object_id) AS [Table] ,
           
sys.columns.name AS [Column] ,
           
sys.stats.name AS [Overlapped] ,
           
autostats.name AS [Overlapping] ,
           
'DROP STATISTICS [' + OBJECT_SCHEMA_NAME(sys.stats.object_id)
            +
'].[' + OBJECT_NAME(sys.stats.object_id) + '].['
           
+ autostats.name + ']'
   
FROM    sys.stats
           
INNER JOIN sys.stats_columns ON sys.stats.object_id = sys.stats_columns.object_id
                                           
AND sys.stats.stats_id = sys.stats_columns.stats_id
           
INNER JOIN autostats ON sys.stats_columns.object_id = autostats.object_id
                                   
AND sys.stats_columns.column_id = autostats.column_id
           
INNER JOIN sys.columns ON sys.stats.object_id = sys.columns.object_id
                                     
AND sys.stats_columns.column_id = sys.columns.column_id
   
WHERE   sys.stats.auto_created = 0
           
AND sys.stats_columns.stats_column_id = 1
           
AND sys.stats_columns.stats_id != autostats.stats_id
           
AND OBJECTPROPERTY(sys.stats.object_id, 'IsMsShipped') = 0

For more information on statistics see Using Statistics to Improve Query Performance in SQL Server Books Online. I also recommend watching Introduction to SQL Server Statistics, a presentation my friend Andy Warren (Blog | Twitter) recently gave for the PASS Performance Virtual Chapter which is available in their Presentation Archive.