A Tale Of Two Execution Plans - Solved!

<div>Earlier this week I wrote about a perplexing problem I was having where identical servers were producing different execution plans for the same ad-hoc query. I got a lot of great feedback, and Wes Brown (Blog | Twitter) even contacted Conor Cunningham (Blog), query optimizer team lead, for me. But it was this suggestion from Paul Randal (Blog | Twitter) and an IM from Jonathan Kehayias (Blog | Twitter) that helped me uncover what was going on: <img alt="image" border="0" height="98" src="../../_nNmzpgOs8Bg/S3V-OhaVKsI/AAAAAAAABn4/6SF69C61dxI/image[41]_imgmax-800.png" style="border-bottom: 0px; border-left: 0px; display: block; float: none; margin-left: auto; border-top: 0px; margin-right: auto; border-right: 0px" title="image" width="244"/> Quick Recap Recall that my problem started out like this: A simple ad-hoc query executed against 9 servers with (more or less) identical hardware, identical schema, identical data, and identical index statistics produced different execution plans. I looked at a variety of things to try and figure it out: SET options on the connection, collations, compatibility levels, server configuration options, build versions, and index statistics. Ultimately everyone who looked at it figured that the root cause of the problem was probably related to this: In Plan "A" the estimated number of rows for an index seek differed from Plan "B", and that in turn lead the query optimizer to choose an index seek + key lookup vs. an index scan on a different table later in the execution plan. <img alt="Plan "A"" border="0" height="491" src="../../_nNmzpgOs8Bg/S3V-OyvvV1I/AAAAAAAABn8/nww1t55DhOc/image[34]_imgmax-800.png" style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="Plan "A"" width="315"/> <img alt="Plan "B"" border="0" height="491" src="../../_nNmzpgOs8Bg/S3V-POwI9gI/AAAAAAAABoA/AxKxDBa_XjY/image[31]_imgmax-800.png" style="border-bottom: 0px; border-left: 0px; display: inline; border-top: 0px; border-right: 0px" title="Plan "B"" width="314"/> Doesn't That Look Like A Problem With Statistics? Yes, you'd think that…except that I updated statistics for the index in question (well, for every index on the table actually) with FULLSCAN specified. That should have taken care of the problem, right? Not exactly. One thing I didn't think about was that AUTO_CREATE_STATISTICS was set to ON for every database (read more about this option <a href="http://msdn.microsoft.com/en-us/library/ms190397.aspx" target="_blank">here</a>). With this option on the query optimizer will create statistics on individual columns, as necessary, to improve estimates for query plans. You can see which columns have had statistics created for them by using sp_helpstats (<a href="http://msdn.microsoft.com/en-us/library/ms173416.aspx" target="_blank">BOL entry here</a>).  A Eureka Moment A check against the 9 servers revealed that the 4 servers currently with plan "B" had statistics auto created for the column used in the query's join and the 5 servers with plan "A" did not. I picked one of the servers using plan "B", dropped the auto created statistics, and…the query optimizer was now picking plan "A"! I did the same against every server using plan "B", and every server switched to using plan "A". Ahh, I love repeatability! (Note that dropping auto created statistics isn't something you want to casually do - the query optimizer created them for a reason, after all) So The Answer Is… In the end, it had nothing to do with build versions, slight variations in CPU models, fragmentation, SET options, server options, or any of the multitude of things I looked at…it was the auto created column statistics that affected which plan the query optimizer choose. The takeaway? Add that as one of the things to check when a query isn't behaving like you think it should. Thanks again to Paul Randal and Jonathan Kehayias for pointing me in the right direction. I can once again sleep at night now knowing that there's a reasonable explanation for what was happening. </div>

7 comments

Stephen said...: So then why did some servers create the stats on that column, while others didn't? Doesn't tracking this down become your next issue?; February 12, 2010 at 11:42 AM
Merrill Aldrich said...: Great sleuthing, great post. Thanks for sharing.; February 12, 2010 at 2:00 PM
Ralph B said...: Nice find Kendall!; February 12, 2010 at 6:23 PM
オテモヤン said...: This comment has been removed by a blog administrator.; February 13, 2010 at 9:33 AM
Anonymous said...: Nice post. Glad to hear you came to some explanation. Curious, did the comment on your original post about the query optimizer terminating early and not completing its optimization end up having anything to do with it?; February 16, 2010 at 10:20 AM
Andy Warren said...: Good follow up, though seems like still a piece or two of mystery remaining. And didn't Jack and I ask if the stats match??!; February 16, 2010 at 6:33 PM
Phil said...: Well, I've tried to reproduce this numerous times, and get very inconsistent results. It's not difficult to get the optimiser to choose a non-optimal plan with both sets of stats in place, but dropping the column stats doesn't always make it revert to the good plan. Also, running UPDATE STATISTICS almost always seems to resolve the "baad-plan" issue too.

Just once, I saw behaviour that seemed to mirror what you describe in your article, but I've only managed it once, even when running a single test script for repeatability and consistency.

Furthermore, I see very inconsistent results when trying to get the column stats to update automatically. Even following large table updates (5,000 rows - 50% of the table), the auto stats remain unchanged; September 17, 2011 at 8:24 PM

Kendal Van Dyke

7 comments

Worth checking out

Popular Posts

Labels

What I'm Saying On Twitter

Kendal Van Dyke

Share This

Related Posts

7 comments

Worth checking out

Popular Posts

Labels

What I'm Saying On Twitter