Monday, November 24, 2008

Reasons To Change CommitBatchSize And CommitBatchThreshold

In my last post I showed how CommitBatchSize and CommitBatchThreshold work. Now the question is why would you want to change them? The simple answer is that usually you don’t need to – the defaults work just fine most of the time. But there are a few cases where you may consider adjusting them:

Why you would lower the values

  • Your subscriber experiences a consistently high volume of activity and you want to minimize locking. Think SQL Servers sitting behind public facing web servers. Remember, replication delivers commands to subscribers in a transaction which cause row locks that can lead to blocking. Reducing the number of commands in each transaction will shorten the duration of the locks but be careful - there’s a fixed overhead to committing transactions so by lowering the values the tradeoff is that your subscribers will have to process more of them.
  • Your network between distributor is subscriber is slow and\or unreliable. Lowering the values will result in smaller transactions at the subscriber and if a network failure occurs there will be a smaller number of commands to rollback and re-apply.

Why you would raise the values

  • You want to increase replication throughput. One example is when you’re pushing changes to a publishing subscriber over a WAN connection and you don’t care about blocking at the subscriber. Raising the values means more commands are included in each transaction at the subscriber and fewer transactions means less overhead. Microsoft suggests that “increasing the values twofold to tenfold improved performance by 5 percent for INSERT commands, 10-15 percent for UPDATE commands, and 30 percent for DELETE commands” (take this with a grain of salt though – it was written back in the SQL 2000 days). The thing to watch out for is that at some point system resources at the subscriber (e.g. disk I/O) minimize the benefits of increasing the values. Also consider that more commands per transaction means that any failure at the subscriber will take longer to rollback and re-apply.

How much you raise or lower the values depends on a number of factors including: hardware horsepower, bandwidth, and volume of changes being replicated. There’s no one good answer that applies to all scenarios. The best thing to do is change them a small amount at a time and observe the impact – positive or negative. Eventually you’ll find the sweet spot for your environment.

Thursday, November 20, 2008

How CommitBatchSize And CommitBatchThreshold Affect Replication

Note: This is a long post. I hope it’s worth your while to read.

A common suggestion for optimizing transactional replication performance is to adjust the values in your distribution agent profile for CommitBatchSize and CommitBatchThreshold. Unfortunately what these two value really do isn’t documented very well anywhere. According to Books Online:

  • CommitBatchSize “Is the number of transactions to be issued to the Subscriber before a COMMIT statement is issued. The default is 100.”
  • CommitBatchThreshold “Is the number of replication commands to be issued to the Subscriber before a COMMIT statement is issued. The default is 1000.”

When you read this you might wonder what the difference is between a transaction and a command. After all, isn’t a command just an autocommit transaction? Does this mean that the commands in a transaction could get broken up into smaller transactions at the subscriber? Won’t that violate ACID? Microsoft went out of their way to make these two distinct values so each one has to influence delivery of commands in some way, right?

I went the cheap route to find out and posted to the forums at SQLServerCentral and to microsoft.public.sqlserver.replication. While I waited for an answer I set up some simple tests to try and figure it out for myself.

Before reading on, I want to test your replication mettle and give you a little quiz. If you want to cut to the chase and see if you’re right, jump to the conclusions at the end of this post to view the answers.

  1. Based on the defaults above, what happens if I issue 1,500 insert\update\delete statements which each affect a single row all contained within an explicit transaction (e.g. BEGIN TRANSACTION, 1,500 updates, then COMMIT)?
  2. What if those 1,500 statements are not part of a transaction?
  3. What happens if I issue three update statements: the first affecting 600 rows, the second affecting 500 rows, and the third affecting 400 rows?

 

Setup

I started by creating a simple table and populating it with 1,000 rows (representative of a large and random sample set):

SET NOCOUNT ON
GO
CREATE TABLE TransReplTest (
    [ID] INT IDENTITY(1,1),
    [SomeDate] DATETIME,
    [SomeValue] VARCHAR(20),
    CONSTRAINT [PK_TransReplTest] PRIMARY KEY CLUSTERED (
        [ID]
    )
)
GO
INSERT INTO TransReplTest (SomeDate, SomeValue)
SELECT GETDATE(), CONVERT(VARCHAR(20), RAND() * 1000)
GO 1000

I created a publication, added a table article to it, and set up a subscription. I created a new distribution agent profile based on the default agent profile and changed only the values for CommitBatchSize and CommitBatchThreshold. Each time I changed their values I stopped and started the distribution agent to ensure I was using the new value. Finally, I set up a profiler session capturing the following events on the subscriber: “RPC:Completed”, “SQL:BatchCompleted”, and “SQL:BatchStarting”. I captured some extra noise so for the sake of readability I saved my profiler sessions and filtered them for each test afterwards.

Test 1

To make things easier to understand I set the values low, like this:

CommitBatchSize: 3
CommitBatchThreshold: 5

First I ran this statement which randomly updates 10 rows, one at a time, all contained within a transaction:

BEGIN TRANSACTION

DECLARE @rownum TINYINT
SET @rownum = 0

WHILE @rownum < 10
BEGIN

    UPDATE TransReplTest
    SET SomeDate = GETDATE()
    WHERE ID IN (
        SELECT TOP 1 ID
        FROM TransReplTest
        ORDER BY NEWID()
    )

    SET @rownum = @rownum + 1
END

COMMIT

Profiler shows 1 transaction with 10 updates. CommitBatchThreshold didn’t appear affect anything here.

Click to enlarge

When I issue the same statement without the explicit transaction Profiler shows 4 transactions of 3 updates, 3 updates, 3 updates, and 1 update. CommitBatchSize causes every 3 statements (each of which is an autocommit transaction) to be grouped together in a transaction on the subscriber.

Click to enlarge

Next I issued a single update statement affecting 10 rows:

UPDATE TransReplTest
SET SomeDate = GETDATE()
WHERE ID IN (
    SELECT TOP 10 ID
    FROM TransReplTest
    ORDER BY NEWID()
)

Profiler shows 1 transaction with 10 updates at the subscriber. Again CommitBatchThreshold did not affect anything here.

Click to enlarge

Finally, I issued an update statement affecting 3 rows 3 times:

UPDATE TransReplTest
SET SomeDate = GETDATE()
WHERE ID IN (
    SELECT TOP 3 ID
    FROM TransReplTest
    ORDER BY NEWID()
)
GO 3

This time Profiler shows 2 transactions; the first contains 6 updates and the second contains 3 updates. Even though CommitBatchSize was set to 3, the number of rows affected by the first two statements exceeded CommitBatchThreshold (set to 5) and so the third statement was put into its own transaction.

Click to enlarge 

Test 2

Let’s see what happens if we switch the values around and run through the same statements as Test 1:

CommitBatchSize: 5
CommitBatchThreshold: 3

Randomly update 10 rows in a single transaction. We still see 1 transaction with 10 updates.

Click to enlarge

Now the same statement without the transaction. Now we see 4 transactions with updates in batches of 4, 1, 4, and 1. This time CommitBatchThreshold set to 3 was the reason we saw the smaller batches. But wait – shouldn’t a CommitBatchThreshold of 3 mean we should see 3 transactions with 3 updates in each transaction and 1 transaction with 1 update?

Click to enlarge

Next, the single update statement affecting 10 rows. Profiler shows 1 transaction with 10 updates. CommitBatchThreshold made no difference here.

Click to enlarge

Finally, updating 3 rows 3 times. Profiler shows 1 transaction with 6 updates and 1 transaction with 3 updates. Just like in Test 1, even though CommitBatchThreshold was set to 3, the number of rows affected by the first two statements exceeded CommitBatchSize (set to 5) and so the third statement was put into its own transaction.

Click to enlarge

Conclusions

Based on the profiler traces, here’s what we can conclude:

  • Replication delivers changes to subscribers one row at a time; a single update statement which changes 1,000 rows at the publisher will result in 1,000 statements at the subscriber.
  • Transactions at the publisher are kept intact at the subscriber. If a million rows are affected inside a transaction – either explicit or autocommit - on the publisher you can count on a million statements in one explicit transaction to be delivered to the subscriber. We saw this in both the statement which updated 10 rows one at a time in an explicit transaction and the statement which updated 10 rows in one shot.
  • The lower of the two values for CommitBatchSize and CommitBatchThreshold is the primary driver for how statements are delivered to subscribers. When the number of transactions on the publisher reaches that lower value the distribution agent will apply them in a single explicit transaction at the subscriber. We saw this in the statement which updated 10 rows one at a time (not in a transaction).
  • The higher of the two values can cause commands to be delivered before the lower value is met. If the number of rows affected at the publisher reaches the higher value the distribution agent will immediately apply the changes in a single explicit transaction at the subscriber. We saw this in the statements which updated 3 rows 3 times.

OK, now that we know how CommitBatchSize and CommitBatchThreshold work, let’s look at the answers to the quiz:

  1. Based on the defaults above, what happens if I issue 1,500 insert\update\delete statements which each affect a single row all contained within an explicit transaction (e.g. BEGIN TRANSACTION, 1,500 updates, then COMMIT)?
    Answer: Transactions at the publisher are honored at the subscriber so we’ll see one explicit transaction applied at the subscriber which contains 1,500 updates.
     
  2. What if those 1,500 statements are not part of a transaction?
    Answer: Because CommitBatchSize is set to 100 we’ll see 15 transactions at the subscriber, each containing 100 statements which affect one row per statement.
     
  3. What happens if I issue three update statements: the first affecting 600 rows, the second affecting 500 rows, and the third affecting 400 rows?
    Answer: Because CommitBatchThreshold is set to 1,000 we’ll see 2 transactions at the subscriber. The first transaction will contain 1,100 statements and the second transaction will contain 400 statements.

BTW, remember that one weird case where CommitBatchSize was set to 5, CommitBatchThreshold was set to 3, and when we updated 10 rows one at a time we saw 4 transactions with updates in batches of 4, 1, 4, and 1? I think that’s a bug in SQL Server. It looks like the distribution agent alternates between (CommitBatchThreshold + 1) and (CommitBatchThreshold - 2) number of commands placed into each transaction delivered to the subscriber. Since this only appears to happen when CommitBatchSize is higher than CommitBatchThreshold – and most people don’t change the values to work that way – this seems relatively insignificant…but still not the behavior that I expected to see.

P.S. – I did get a response to my usenet posting from MVPs Hilary Cotter and Paul Ibison. You can read it here. Gopal Ashok, a program manager in the SQL Replication team, confirmed that what I thought was a bug was indeed a bug. Yay me!

Monday, November 17, 2008

Two Mistakes

Two mistakes was a post started by Chris Shaw that’s been making the rounds of SQL bloggers as each poster challenges others to write about their mistakes. There’s a Chinese proverb that says “Even a wise man sometimes make a mistake”. Andy Warren seems to think I’ve made a lot of mistakes and tagged me in his post. He must think I’m a pretty wise man.

Mistake #1:

I inherited this mistake, but since I didn’t correct it before the fit hit the shan it’s on me too. About a year and a half into my first job out of college I inherited the role of IT Manager after the person who had the job was let go (it was a small company, obviously). One of the first things I did was to make sure we were backing up our databases. We were, but there was something looming that I was too naive to know about at the time. We only had one SQL Server – a Dell connected to a PowerVault DASD array. Problem #1 was that it was configured as RAID 5 and for about six months there was a failed drive that no one noticed. Problem #2 was that the backups were written to a partition on the PowerVault. I think you can see where this one’s going – about two weeks after I took over we lost a 2nd drive in the RAID 5 array and the server took a nose dive. Worse yet, it happened while I was out of town and the junior guy was on call (if I was a year and a half out of college at the time, imagine how junior the junior guy was!). It cost us a LOT of money to work with a drive recovery company to get some of the data back, plus a week of lost sleep and red-eye flights to the datacenter to rebuild the hardware. I also learned for the what it’s like to spend a night on a concrete bench in the Atlanta airport. Oh and we had to explain to all of our clients why they were down for a week.

What I learned: Don’t store your backups on the device you’re backing up. Monitor your hardware for drive failures. If you inherit something, don’t ever assume the person before you did it right; look for the skeletons in the closet and fix them ASAP. Bring a travel pillow with you whenever you go out of town.

Mistake #2:

At one of my previous jobs I was on a team of people responsible for the daily care and feeding of our software in production. Our responsibilities included database administration, release management, manual intervention for tasks that hadn’t been written into the software yet, and occasionally fighting fires when something unexpected came up. We used MSMQ pretty extensively, and during a release our release manager pushed a configuration file that pointed the software at a non-existent MSMQ server. It only took us about an hour to realize something was up because there were dollars tied to those messages and the numbers were down from where they should have been. We took the divide and conquer approach to fixing things and somewhere along the way two servers were left with the bad config. This was in the days of Windows 2000 which does this funny thing with MSMQ where an outbound queue will accept messages without error but after the outbound queue size hits 2 GB it will just send those messages into the ether. No errors, no warnings, nada. So sure enough, those two servers with the bad config hit their 2 GB limit and happily threw away everything after that. We never realized a thing because it was two servers out of eighty and we were busy looking at the big picture rather than paying attention to the details, and of course we didn’t have anything monitoring MSMQ outbound message counts. When revenue came back to around what we expected no one double checked that every server was contributing an equal amount. Needless to say, it was some time before we figured out that those two servers still had a problem, and by the time we did it was too late to recover the messages – and the dollars – that were lost.

What I learned: Good communication is vital for success. Pay attention to the details! If you fix a problem, be skeptical - check, double-check, and triple-check that you really fixed it. Never underestimate the importance of monitoring, or the costs associated with not doing so.

 

Time to pass it along…Jon Kehayias, you’re it!

Tuesday, November 11, 2008

Delegation: More On Service Principal Names

I’m adding to my previous post about delegation to provide more information about Service Principal Names. Recall that you can add SPNs using setspn with the –A switch. Once you’ve run the command to add the SPN you’ll want a way to verify that it took. Normally you can do this by calling setspn with the –L switch and a target computer which will list all the registered SPNs for that computer, but in my experience I have seen custom SPNs excluded from this list. Fortunately there’s a handy VBScript available on TechNet that will query Active Directory and does return a complete list (link below).

If you have verified that your SPN exists but are experiencing problems using it you can reset all SPNs for a computer using the R switch. Finally, if you need to delete a SPN you can use the –D switch. Running setspn without any arguments will output a list of all the available switches and syntax for using them.

DOWNLOAD: SPN Query utility from TechNet

Monday, November 10, 2008

Delegation: What It Is And How To Set It Up

The Problem

You’re a security conscious DBA so you follow best practices; you run your SQL services as a domain account and have all your users and applications using Windows authentication. Then one of your developers comes to you and says he’s getting this error running a query that uses a linked server:

Msg 18456, Level 14, State 1, Line 1
Login failed for user 'NT AUTHORITY\ANONYMOUS LOGON'.

What they’re likely seeing is the result of something known as the “double-hop”:  A connection from Machine A to Machine B asks Machine B to access something on Machine C using the credentials originally provided by Machine A. In this scenario, Machine B using the credentials from Machine A is known as impersonation and Machine B’s ability to use impersonation to connect to Machine C is called delegation. The errors occur when delegation has not been properly configured. There are couple of ways this may manifest itself:

  • Bulk inserting a file from a UNC share on another computer (e.g. \\servername\share\file.bcp)
  • An Ad-hoc distributed query using a trusted connection
  • Using a linked server that’s been configured with a trusted connection

 

The Solution

There are two ways to solve this problem – either use SQL authentication or configure delegation. Since this post is about fixing the double hop problem in a Windows auth environment I’ll save the debate on authentication methods for another time and focus on what it takes to get delegation working. Microsoft has a nice write-up on how to configure delegation on MSDN. I’ve added to it by providing screenshots and my own comments.

To use delegation, all computers must use Active Directory and have Kerberos support enabled (Kerberos is the default authentication method in Windows 2000 and up, so this usually isn’t an issue). The following options in Active Directory must also be specified (which you get to through the Active Directory Users and Computers MMC snapin):

  • The Account is sensitive and cannot be delegated check box must not be selected for the user requesting delegation.

image

 

  • The Account is trusted for delegation check box must be selected for the service account of SQL Server. This is a bit outdated; for Windows Server 2003 there’s a Delegation tab where you can choose the different levels of trust for delegation (note that the tab doesn’t show up until SQL Server is started using that account). For those ultra-tight on security you can specify services on a computer-by-computer basis…for the slightly more relaxed there’s the “any service” option.

image 

  • The Computer is trusted for delegation check box must be selected for the server running an instance of SQL Server. Same deal as the previous setting – there’s a tab for this with different options depending on how tight you need to control security.

image

 

Next, your SQL Server service must have a Service Principal Name (SPN). This is essentially a unique name in Active Directory for an instance of a service on a computer. When SQL is run under a domain account a SPN needs to be manually created using the setspn utility found in the Windows 2003 Support Tools. The command for this utility amounts to something like this:

setspn -A MSSQLSvc/[FULLY QUALIFIED DOMAIN NAME]:[PORT] [DOMAIN]\[LOGIN]

For example:

setspn -A MSSQLSvc/hq-dbadmin01.mydomain.com:1433 MYDOMAIN\sql.service

 

Finally, there are a few things to do to make it all work:

  • You have to restart your SQL Service after creating the SPN.
  • For distributed queries or queries using a linked server where both servers are running as a domain account you must create a SPN for both servers
  • You must use Kerberos for all of this to work. To check if you’re using Kerberos, after you’ve gone through all the steps above make a new connection and run the query below. It should return "KERBEROS” if everything’s set up right:

select auth_scheme from sys.dm_exec_connections where session_id=@@spid 

 

VISIT: Administering SQL Server (SQL Server 2000): Security Account Delegation
VISIT: MSDN Library: Service Principal Names
VISIT: Wikipedia: Kerberos
DOWNLOAD: Windows Server 2003 Service Pack 1 32-bit Support Tools