Problem-Solving Techniques: The Sniper vs. the General
What is the best approach to follow when dealing with a critical incident in production?
10:00 a.m. and the team receives an incident: “The product selector on the first page of our e-commerce is showing only 50% of the product,” problem-solving starts.
The team owning the e-commerce site starts the investigation. One team member says that most likely the problem is at the ETL responsible for the product cache that feeds this selector and runs at 9:00 a.m., so he is going to manually rebuild this cache. It will take 30–40 minutes to complete and show the results on the page.
At 11:00 a.m. after the cache rebuild, the issue persists. The same team member arrives with another idea: “probably is the sharding at the cache management. One or more nodes are probably misbehaving. Let’s remove the sharding; it will increase the cost temporally, but the issue will be solved”
Sharding removed, cache rebuilt, and at 12:30 p.m., the team realized that only 50% of the product is available. New idea, “last week we updated the front-end component of the product, and probably some of them are not being displayed.”. The team rolls back the component release with the same result. The day is finishing, and the issue persists.
I’ve seen the previous approach thousands of times. There is an issue in a very complex system with a lot of moving parts and someone immediately thinks about a very likely root cause and applies the solution, just to discover that this was not the cause. This cycle can repeat several times, prolonging the problem-solving to several hours or even days or weeks.
This is the guessing and sniping strategy. The engineer is acting as a sniper, selects a target (guess), and removes it (sniping). If intuition is good, this technique can deliver a solution in record time. But if not, it can end in a whirlpool of attempts, waste of time, and frustration.
There is another approach to a complex issue, the divide and conquer strategy. It consists of seeing the system and all parts involved and thinking about a check we can do that can remove a great part of the pieces (divide). We keep dividing until the remaining area is so small we can identify the problem (conquer).
For instance, in the initial example, another way to manage the case is by making a call to the product API used by the web page and checking if the product is there. If the API only returns 50% of the product, I can discard all the front, JavaScript, or possible updates in browsers. In this case, we would have divided the problem, discard half of it, and make the problem double easier.
The main difference between sniping and dividing is the objective of the investigation. When sniping, we look for something wrong. When dividing, we look for something that will determine if one side or the other is affected.
Now we have two approaches to the same problem: the sniper approach that can eliminate the possible root causes one by one or the general approach that likes to see the big picture and play a strategy to divide and conquer the enemy. Let’s compare both techniques and see their pros and cons.
The sniper strategy can be very fast if the guesses made are correct. How likely are the guesses correct:
The more senior and experienced in the systems our people are, the more likely the guess is correct.
The more complex the system, the less likely the guess is correct.
The more they iterate in trying solutions, the less likely the guess is correct. That is, the first guess is much more likely to be correct than the next. As soon as we try things that prove wrong, the less likely the new ones are the root cause.
The first two points are quite obvious, but the third one deserves a deeper explanation. The team is going to see the problem and think about some possible root causes. Unconsciously, they will select the most likely option. As they try and fail, the guessing that the team will make will be statistically less likely to be the root cause.
On the other hand, the general approach is safer. It could take some time to get in control of the situation, but the more time spent, the close we are to the solution. It’s the opposite of the sniper approach, where the more time spend, the farther we are.
The sniper can be fast; the general is safer. What to choose? Both.
The best practice is to see if there is a clear candidate for sniping. The first options are highly likely to be the correct ones. Especially when dealing with systems that our people master. But as soon as we see that the first options are proved wrong, we need to sit down, analyse the big picture, and change the approach to the general’s one.
If you are a manager, maybe this will be of your interest. One situation I’ve experienced is a team snipping, a time going by, so I decided to intervene and ask the team to change the approach. The first reaction of the team was: “We are the technical experts here; let us work!”. And they were right. At the same time, I knew that the approach was not the right one. What to do in this kind of situation?
My lesson learned from those incidents and the one that I want to share with you is that teaching in the middle of the fire is not a good thing to do. When people are stressed, it is not the moment to start lecturing your team.
If the team is not confident and lacks leadership or coordination during the issue, they are going to be happy if you step into the situation. You are their manager, and most likely they are expecting it from you. That’s the support they need. In those cases, you have more options to interact. Don’t give lectures at this time, but agree on decisions with them that are in the right direction.
On the other hand, if the team is confident and has a clear direction, you should never step into it. Let them work. Your role at this time should be as a stakeholder asking how to help and what’s the status, expectations, and confidence.
Your role should be more in line with protecting them, removing blockers, keeping a log, and keeping communications with the wider organization. At some moment, they could be showing signs of lack of confidence, despair, and asking for help. That’s the door opening for you to enter and help them.
The moment for analyzing how we dealt, talking about the approach, and seeing what’s better will come later, at the post-mortem or lessons-learnt stage. The teaching and lecturing of those best practices need to happen outside the crisis.