Friday, November 30, 2007

War Stories: Bringing Down the Server With a Simple Select Statement

What follows is one of my more horrific war stories. There were many casualties, and I was almost killed in action, but my indomitable fighting spirit spurred me on. That, and I took two days off that resulted in a five-day vacation, since it fell during the week of Thanksgiving.

I was writing a SQL query that would merge users. Since users are connected to numerous tables throughout this particular database, this requires a lot of updating of different tables. Also, since the user table is the table with all the, you know, user data like UserName and UserPassword, that's the table that's queried when a user logs into the website. Easy stuff, right? Obviously, this user table is an extremely important table in the database.

Since I'm having to update all these tables with new user information and later remove old user information, I wisely decided that all of these UPDATE queries need to be wrapped in a transaction. For the uninitiated, when a group of queries is wrapped in a transaction, all of the queries function as one atomic unit. This means that if any one query in the transaction fails, all changes made up to that point in the transaction are rolled back. To the programmer and the end user, it appears as if nothing has happened. Maintaining such data integrity is absolutely imperative where I work, since we don't use test databases and instead directly query the production databases. At the end of my transaction, just before the COMMIT statement, I select from the user table to grab some pertinent information.

So I'm writing my transaction and feeling great about life. I put the transaction in a Try block and put code in the Catch block to rollback the transaction, thus guaranteeing that my transaction will behave correctly. Or so I thought.

As I was testing my shiny new transaction, it fails. No sweat, I think, my catch block just rolls everything back and we're good as gold. Then I notice that I'm receiving email notifications indicating that individuals are not able to login to the website. At the time, I don't make any connection with what I'm working on and this login problem, so I continue working.

For the next forty-five minutes, these login errors keep popping up. No one has said anything to me, nor do the errors indicate that any problems were occurring in the user table, so I continue merrily along, testing my transaction and watching it fail, but thinking nothing of the failure.

Then my boss calls me into his office. I immediately recognize the tone as the "You just seriously f__ked up tone." I walk in and he has the activity monitor pulled up, which shows a stalled transaction that's locked the user table. He asks me what I'm doing, I tell him, and he immediately lays into me, asking why on God's green earth I'm SELECTing in a transaction, and if I absolutely have to, why I'm not using WITH(NOLOCK), which keeps the table from being locked during a read. Not having known that it was possible that my transactions were not being rolled back and further not having known that my SELECT statement was locking the user table, I proceeded to put on my best deer-in-the-headlights look and stammer stupidly. He sent me back to my office and had one of my coworkers join me to explain what had happened.

He was so mad, he did not even chew me out after the initial questioning. It was kind of surreal, as up until that moment in my life, I had never felt so ashamed at not being chewed out. I had messed up that badly. I was so absolutely mortified that I wanted to go hide underneath a rock. This happened around 2 o'clock in the afternoon, and I gave serious thought to just leaving. In retrospect, I'm surprised that my boss didn't send me home. I even began to question whether or not software development was truly my thing.

So what lessons did I glean from this? Several, listed below and in no particular order:

  1. If you're selecting from a heavily-used or mission-critical table, always use WITH(NOLOCK) to keep the read from locking the table.
  2. Do not rely in SQL Server 2005's Try and Catch blocks to work correctly. I still have no idea why my Catch block was never executed. Instead, if using a transaction, you should precede it with SET XACT_ABORT ON and end it with SET XACT_ABORT OFF just after the COMMIT statement. SET ARITHABORT ON ensures that the current transaction, and only the current transaction, is rolled back if any of its contained queries fails.
  3. Use named transactions. If you don't use named transactions and they become nested, say by having a stored procedure that contains a transaction which calls another stored procedure that contains a transaction, then it's possible that either the wrong transaction will be rolled back, or none of them at all.
  4. The activity monitor is your friend! I now keep it open at all times and check it often during testing.
So that was one of my most embarrassing moments in my career. However, it did teach me a lot, and the lessons I learned about transactions will probably never be forgotten.

No comments: