Like a lot of other companies, we use imaging software to get new\repurposed machines up and running quickly. We recently switched to a new imaging product and it seemed to work well. We used it on a couple of SQL Servers that made their way into production and shortly after they went live we noticed intermittent errors with clients trying to connect using Windows authentication that looked like this:
Login failed for user ''. The user is not associated with a trusted SQL Server connection.
This usually happened a few times a day for a couple of minutes and then the system self-corrected. We also found that right before the login failures occurred was this entry in the system event log:
Event Type: Error
Event Source: NETLOGON
Event ID: 5719
This computer was not able to set up a secure session with a domain controller in domain [OUR DOMAIN] due to the following:
Not enough storage is available to process this command.
This may lead to authentication problems. Make sure that this computer is connected to the network. If the problem persists, please contact your domain administrator.
So the NETLOGON service was failing and killing SQL’s ability to authenticate users. While digging through the logs from each image looking for a possible clue one of my coworkers noticed a single line buried deep within that said something to the effect of “SID change failed”.
SID stands for “Security Identifier” and is a unique alpha character string that’s used in Windows and Active Directory as the basis for identifying computers and objects in the operating system (read more about SIDs here). Usually when you apply an image you change the SID to make your server unique, thereby avoiding potential security problems (see Microsoft KB article 162001), only in our case the imaging software failed to do so and as a result we had several machines on the network that had duplicate SIDs.
One Solution…And The Problem It Causes
One solution to the dupe SID problem is to use the NewSID utility from Sysinternals to create a new SID for the affected servers. But here’s the catch: Local accounts, groups, object names, etc. in the OS are based on the machine SID. Change the machine SID and every other SID gets changed too. During installation SQL Server creates a set of local machine groups (e.g. MACHINE\SQLServer2005MSSQLUser$MACHINE$MSSQLSERVER) that are granted rights throughout the OS and in SQL Server so that the services can run. SQL Server stores the logins - and SID - for each of those groups in
master.dbo.syslogins. NewSID wasn’t written with SQL awareness so after using it to change the machine SID (which in turn changes the SID for each local group) you’re going to be left with the old group SIDs in SQL, new group SIDs in the OS, a SQL service that won’t start, and probably a lot of people wondering “WTF when is the server going to be back up?”
A Solution To The Problem Caused By The Solution
I’ve been through the SID change exercise a few times now so I’ll share with you what to do if you need to change the SID on one of your SQL severs.
0. Find out what Service Pack and Cumulative Update your SQL service is running using the statement below. Get a copy of it and keep it handy because you’ll need it later.
SELECT SERVERPROPERTY('productversion'), SERVERPROPERTY ('productlevel'), SERVERPROPERTY ('edition')
1. Change SQL service and SQL Agent service to run as Local Service (or Local System) and manual startup.
2. Remove server from domain. Reboot.
3. Change SID using NewSID (note: record your old SID before changing it in case you have to roll back!). Reboot.
4. Your SQL service is now broken by the SID change. Here’s where it gets fun. The SQL service will start but the SQL Agent service won’t and a couple of other components like DB Mail will throw permissions errors that show up in the SQL error log. To get things back on track resynchronize what’s in
master.dbo.syslogins with the new local group SIDs by dropping and recreating the SQL logins (and associated permissions) for the groups. First, using Object Explorer script out the drop and create statements for each local group login. Then use this script I posted previously to script the statements which grant all the necessary server level permissions to each group. Cut and paste the statements that apply to the local groups into your working script. What you’re going to execute for each group will look something like this:
DROP LOGIN [MACHINE\SQLServer2005MSSQLUser$MACHINE$MSSQLSERVER]
CREATE LOGIN [MACHINE\SQLServer2005MSSQLUser$MACHINE$MSSQLSERVER] FROM WINDOWS WITH DEFAULT_DATABASE=[master], DEFAULT_LANGUAGE=[us_english]
EXEC sp_addsrvrolemember @rolename = 'sysadmin', @loginame = 'SQL2300\SQLServer2005MSSQLUser$MACHINE$MSSQLSERVER'
GRANT CONNECT SQL TO [MACHINE\SQLServer2005MSSQLUser$MACHINE$MSSQLSERVER]
Execute your script and restart the SQL and SQL Agent services to get them back up and running.
5. Join the server back into the domain. Reboot.
6. When you use Configuration Manager to try and change your services to run under a domain account again you’ll get this WMI error:
To fix the WMI error simply reapply a Service Pack\Cumulative Update that is the same or newer than the SQL service is already running at. But…when you try to do the install it will fail. Dig through the install log in C:\Program Files\Microsoft SQL Server\90\Setup Bootstrap\LOG\Hotfix and you’ll find this error:
MSP Error: 29528 The setup has encountered an unexpected error while Setting Internal Properties. The error is: Fatal error during installation.
Basically the problem is that old SIDs are stored in the registry and need to be deleted. Check out Microsoft KB article 925976 for details on what to do. Once you’ve deleted the registry entries you can proceed with reinstalling the current SP\CU. Reboot. Now you should be able to use Configuration Manager to change your services to run under a different account.
At this point everything should be running again; the problems caused by changing the SID are resolved and the login errors from the dupe SID should also go away. *PHEW*!
Although I was able to reproduce and fix the problem several times your installation might have some nuances that mine doesn’t, so consider these steps as more of a guide than a hard and fast rule. Although I used NewSID there are other SID changing utilities out there that might create the same problem. As always, test this out in a non-production environment first!