Wednesday, July 16, 2008

Troubleshooting 101

Troubleshooting and/or debugging is a large part of our job as geeks.  Bad code, misbehaving apps, finicky servers, we have to juggle all that and more.  Trying to wade through complex systems is never fun, particularly when you're under stress for a delivery or trying to get a production server back up.

I've had a number of years' experience troubleshooting different systems, both hardware and software.  I do not claim to be a troubleshooting ninja, but I've picked up a number of solid lessons learned over time.  This post won't be an in-depth treatise on using the debugger or some obscure tools, rather its on more high-level approaches to help cut your churn.

First off, get serious and go buy a copy of David Agans' Debugging. This book distills Agans' amazing skills into a highly readable, awesome guide.  Everything I have to say pales in comparison.  Go, buy it right now.  I'll wait.

Now that you've ordered that book (and gotten me at least $0.07 in referral fees from, thanks) here are some things I find highly useful for my approach to troubleshooting.  The items aren't listed in any particular order, just a stream of consciousness.  Let's start off with two things which require an incredible amount of self-discipline.


Time box your efforts, right from the start.  Time boxing is perhaps the hardest thing for us geeks to deal with.  "Just five more minutes and I know I'll find the answer!"  Yeah, you said that yesterday morning, dumbass, and now it's 4:30pm the day before delivery. (That would be me, talking to me.)

It sounds silly, but get yourself an egg timer. Honestly. The first step you take before doing anything in a non-trivial problem should be to set yourself a time box.  "I will work on this data transfer issue for no more than one hour before stepping back and reassessing."   Work until that timer goes off and then stop!  Step back, re-evaluate the problem, and look for someone to bounce your assumptions and theories off of.


Sometimes you can get yourself wrapped around too many red herrings and lose sight of the forest for the trees, to mix metaphors in a really ugly way.  You have to discipline yourself to take a break from the issue, then come back to it and look at things again.  Our office has a hallway that runs in a loop around it.  Now folks who work in the office with me may know why I wander that hallway on frequent occasions mumbling like some deranged bag lady.

When you get back from that break, look over your your forensic evidence you've gathered, the assumptions you've made, and any bits you've managed to wicker out about the problem.

Asking for help is uncool.  Folks will begin to know how big a putz I really am if I ask for help.  No, they won't, at least they won't if you've made some basic effort before reaching out.  You may not need anything more than a body to repeat your problem statement and assumptions to. 

Some book I read at some point had a funny story about a high-level dev in a shop who was plagued by folks busting in to his office to get his advice on problems they'd obviously not thought out first.  The constant interruptions crushed his productivity on tasks he was responsible, and the other devs didn't grow their own skills.  The lead took to putting a stuffed bear in a chair in his office.  Devs would come in to his office and his first comment to them would be "Tell it to the bear."  The devs, simply by verbally stating the problem, their assumptions, and theories, would often solve the problem themselves.  Not that I'm telling you to go get a bear and start talking to it at work, but you get the idea.  Bounce ideas off someone, even if it's someone outside your domain.


M as in Messages. Error messages. Dialog boxes. Log files. Console output. Read them carefully. If it's late at night or after a long session, consider re-writing the messages out on paper. Seriously.  I can't count the number of hours I've lost because I blew over a message and let my (bad) assumptions con me into reading something the message didn't say.

A good buddy told me he made one of his mentorees read error boxes to him aloud and verbatim.  Prior to that the mentoree had a bad habit of not taking initial steps himself, or not catching important details which were clearly displayed on the screen in front of him.  A couple weeks of reading those messages aloud finally kicked the mentoree into taking more initial action himself.  (My pal's much more patient than I.  I'd have lost it after a couple days of trying to prod the guy into doing that...)


Look at your system as a pipe of data.  Break that system into halves.  Probe at the halfway point and figure out if your data is good or bad at that point.  Continue breaking the remainder in half until you isolate something you can get your hands around.


It doesn't matter if you do test driven development or not.  Use tests (integration, unit, whatever) to drive your system as you're narrowing down the problem.  Don't waste your time trying to input data to a web form and try to catch data on the other side.  Write an integration or unit test to stimulate parts of your system and work from there.  You're locking down valid areas of the system while beginning to isolate the part that's borked.  You get the added benefit of deepening your coverage of the system.  (I'm not talking just code coverage here, but effective testing of your code.  Two different things.)


I use the Jot function of SlickRun as a scratch pad for writing down bits and pieces of more problematic problems I'm having problems with.  Use a whiteboard.  Lay out the pipeline of the system and break it in half.  Note your inputs, outputs, and areas of concern.  Visualizing a problematic system can be a great help.


This is last, but it's likely the third-most important tip next to time boxing and stepping back.  Don't just jump into code or whip open a telnet session to your misbehaving server.  Take a moment to look at your assumptions once more and figure out a plan of attack.  Write it down or verbalize it to see if your plan's a solid one:

"OK, so the object coming across the wire from the server to the client is borked when it's saved.  It's showing client local time instead of server time.  I will cut the data pipe in half and start looking at the data on the client side of the web service.  I'll check to see if the correct server time is coming in the data transfer object.  If that's correct then I've isolated it to client side.  If it's not I'll go look at the server side and go from there.  I'll do that first check by writing an integration test to call the web service and get one of the DTOs and validate its timestamp."


Assumptions are a tricky thing.  You need to make some assumptions as you begin chasing the issue, but you may have made wrong ones and may end up chasing a red herring.  Look to the simple answers before diving into the deep end.  The simple solutions are the right fix 87.682% of the time.

If you are spending hours discussing minutia of timestamp comparisons and getting into the bowels of a widely used framework like NHibernate then you are likely on the wrong path.  If your first thought is that your OutOfMemoryException errors are caused by a service pack bug instead of a code change, maybe you should look back over the history of that module you just updated.  (That would have been me, chasing a herring for an hour last week instead of noticing that a colleague had updated how a namespace was handled in a call to XmlDocument.CreateElement -- at my vague direction.  Yeesh.)


Time boxing is so important I've listed it twice. Avoid letting yourself get too frustrated over a problem. Take a walk, go get some fresh air. Shoot nerf bullets at your co-workers.  If you're working next to a colleague who's been churning for some time then show them some love and get them to take a break.

Simple, but critical.


None of these things I've written down here are rocket science.  You're likely doing all these and more, so please feel free to comment with things you find helpful, or resources you've found useful when trying to improve your troubleshooting skills.


Jeff Hunsaker pointed out one of the most elementary things which I'm shocked I missed:


Take a few minutes and look over what in your system changed since the bug/problem/issue appeared.  Look over the code history. Look over the data.  Look over the environment your system's running in.  Something's likely changed, but you need to be realistic and careful about going too far down a rabbit hole.  See that part about "Look to the Simplest Stuff First" above.


Arnulfo Wing said...

Egg timer? Hummm. I *WILL* start using that tip. Sometimes I find myself just asking for one more hour. Even thought I know well that NOTHING takes an hour.

When debugging, I like to white-board a lot!.

I also use Hypersnap and take snapshots of the different states of the data/application/errors/etc. Then I print them out and lay them on the desk. This is the time I use to step away from the computer and use good old pen/paper/highlighter.

BTW, is that debugging book on the Jimazon collection at QSI HQ? ;)

good post.!

Steve Horn said...

I like everything you pointed out in your post...and you're right they're all seemingly obvious, but its good to see these points brought out by someone else to solidify them.

With regards to telling your problems "to the bear"... I can't tell you how many times I've opened an IM chat window to ask a question or get a second opinion and found my error. I found my problem simply because I had to stop and formulate my question in a coherent manner so that the guy I'm talking to can understand. And usually when I read back that question, my problem is obvious.

I checked out Agans book from the Jimazon library. Thanks for the tip.

Jim Holmes said...

@arnulfo Nothing beats a whiteboard! Good idea on
making screenshots of the app. I like that a lot! (And will steal^H^H^H^H^Hborrow the idea.)

@steve Amazing how much time of our own (and others) we waste by not taking the very simple, powerful step of forming a question in a coherent fashion /first/!!

Jeff Hunsaker said...

Great post! For operational / production systems I always like to ask, "What has changed?" 9 times out of 10, if it worked yesterday and it doesn't work today, someone changed something.

Justin Kohnen said...


Great borking post. I'm going to be sure to pass this on.

Subscribe (RSS)

The Leadership Journey