Friday, February 12, 2010

A Tale Of Two Execution Plans – Solved!

Earlier this week I wrote about a perplexing problem I was having where identical servers were producing different execution plans for the same ad-hoc query. I got a lot of great feedback, and Wes Brown (Blog | Twitter) even contacted Conor Cunningham (Blog), query optimizer team lead, for me. But it was this suggestion from Paul Randal (Blog | Twitter) and an IM from Jonathan Kehayias (Blog | Twitter) that helped me uncover what was going on:


Quick Recap
Recall that my problem started out like this: A simple ad-hoc query executed against 9 servers with (more or less) identical hardware, identical schema, identical data, and identical index statistics produced different execution plans. I looked at a variety of things to try and figure it out: SET options on the connection, collations, compatibility levels, server configuration options, build versions, and index statistics. Ultimately everyone who looked at it figured that the root cause of the problem was probably related to this: In Plan "A" the estimated number of rows for an index seek differed from Plan "B", and that in turn lead the query optimizer to choose an index seek + key lookup vs. an index scan on a different table later in the execution plan.

Plan "A" Plan "B"

Doesn't That Look Like A Problem With Statistics?
Yes, you'd think that…except that I updated statistics for the index in question (well, for every index on the table actually) with FULLSCAN specified. That should have taken care of the problem, right? Not exactly. One thing I didn't think about was that AUTO_CREATE_STATISTICS was set to ON for every database (read more about this option here). With this option on the query optimizer will create statistics on individual columns, as necessary, to improve estimates for query plans. You can see which columns have had statistics created for them by using sp_helpstats (BOL entry here). 

A Eureka Moment
A check against the 9 servers revealed that the 4 servers currently with plan "B" had statistics auto created for the column used in the query's join and the 5 servers with plan "A" did not. I picked one of the servers using plan "B", dropped the auto created statistics, and…the query optimizer was now picking plan "A"! I did the same against every server using plan "B", and every server switched to using plan "A". Ahh, I love repeatability! (Note that dropping auto created statistics isn't something you want to casually do – the query optimizer created them for a reason, after all)

So The Answer Is…
In the end, it had nothing to do with build versions, slight variations in CPU models, fragmentation, SET options, server options, or any of the multitude of things I looked at…it was the auto created column statistics that affected which plan the query optimizer choose. The takeaway? Add that as one of the things to check when a query isn't behaving like you think it should.

Thanks again to Paul Randal and Jonathan Kehayias for pointing me in the right direction. I can once again sleep at night now knowing that there's a reasonable explanation for what was happening.

Post a Comment