Thomas RibouletLead consultant
This article is aimed at Senior and Lead software engineers. Devops engineers and SRE could also make use of it but they are generally aware of the practice. Junior people might also find this content useful to get a general understanding before jumping in their first on call rotation or incident resolution.
Let’s be direct and honest : incidents will happen. It’s not a question of ‘if’ it’s a question of ‘when’. So, our usual strategy at Imfiny is to prepare for them and train people to be ready. If you are ready there is a better chance you stay cool in the face of the fire.
Incidents range from minor to critical : just like a fire. It might just be a little grease fire or it might be a huge blaze going through a whole storage unit. But what might start as a grease fire might evolve into a complete building being destroyed. This depends on how the fire, or incident, is handled.
So, any team should have, at the very least, a defined way to structure the response to an incident. It should be clear what to do whenever the first breaks out : gather a response team, figure out what’s happening, figure out and prepare a mitigation step or a resolution step, communicate to stakeholders along the way and queue future work if need be.
Those are the very basic way to structure the response to an incident. Most guides out there on the topic will go in details about this but that’s the gist of it. So, here follows one example of how to do that.
First of all we can’t start such a piece without referencing a couple of great resources on the topic, mostly out of the Google SRE books :
They are great and our usual coaching runs along similar lines with some variations based on our experiences.
When the incident starts a team of at least 2 should gather (depending on the team size), a few more from the start can be helpful but not a must.
Those 2 people :
Then those first 2 should focus on 2 things :
Two things to do, two people : it should be obvious that one person should be the communication person and the other one should be in charge of figuring and solving the issue.
Once that is started it might be time to call up more people to help. If that happens then it’s important to properly structure the team around the following roles :
The reason behind that is to keep responsibilities limited thus allowing each individual to focus on their responsibility and do it best.
At first, “operational work” and “incident command” might be mixed into one role while “communication” and “planning” are mixed into another. But as the number of people involved grow it’s important to spread the roles accordingly.
If you start to have several people doing operational work it’s important to have them do their work in an organized way : that’s the role of the incident command. Having people walking on each other toes when solving an incident is a very dangerous thing that can have heavy consequences.
In a similar approach, having someone taking care of the communication allows for the rest of the team to stay focused on their work. That person not only updates the rest of the company but might also take care of keeping a log of what’s been seen, decided and done through the incident response log.
It can also be useful to have someone taking care of filing tickets for tasks to be done later, ordering food or drinks and organizing hand-offs when needed.
As said before, the first step is to figure out what’s happening and then communicate about it.
This can include grading the incident :
Gravity of incident can also evolve while it’s being responded to. As pointed out in the comparison with fire : a small, grease fire, can grow into a full blown blaze engulfing a building. A minor incident can evolve into a critical one either because consequences pile up with time or because a bad idea make things worse.
With the severity figured, often comes the impact of the problem. This might require some additional investigative work though, and that’s ok.
In any case the team should then turns toward figuring out if the problem can be solved quickly, or if a mitigation is possible or needed before reaching full resolution.
Then the people in charge of the operational work should get to work as directed by Incident Command.
During that process :
Once a solution has been found and put in place, the team should go through a debrief. If it has been done properly during the incident response, the incident document should already be enough to go through what happened, what has been seen, decided and done. At that point the aim moves from “solving” to finally put down why the incident happened, what has been seen, what has been done, what are the consequences of both the incident itself and of the resolution steps that have been done and what we learn from all this for the future.
At the end of the debrief and in the days following it a final report should be prepared and presented to the team and stakeholders. This is what we called a “post mortem”. It should state clearly the severity of the issue, the how and why the incident happen, how it was tackled and the learnings we can get from all this.
“Practice makes perfect” as the saying goes. So don’t hesitate to prepare for incidents by faking ones, using some support issues as incidents and solving them as if they were incidents. You can also do some role playing with one lead or senior engineering playing the part of being the eyes and ears of the team during a prepared incident scenario.
This can go as far as actually creating incidents voluntarily with bricks such as the famous “chaos monkey” from Netflix.
Finally, incident response isn’t the only responsibility of a selected few in the team, nor are the roles within incident response teams. Everyone should be familiar with each role and facet of incident response. So don’t be afraid of taking up a role you have not done before or getting someone to do a role they are not used to. The aim is not to put someone at risk of failures but instead give them the opportunity to learn and hone their skills.
We have helped several teams set up or improve their existing incident response strategies. This outline is but a base for such a strategy and we would be happy to chat with you to see how to put one in place in your team. Contact us, let’s have a chat.
A RubyOnRails consultancy based in the EU, we build your applications and services all over the world !