2 November 2021

Beyond the usual : incident management 101

Thomas Riboulet

Lead consultant

This article is aimed at Senior and Lead software engineers. Devops engineers and SRE could also make use of it but they are generally aware of the practice. Junior people might also find this content useful to get a general understanding before jumping in their first on call rotation or incident resolution.

Let’s be direct and honest : incidents will happen. It’s not a question of ‘if’ it’s a question of ‘when’. So, our usual strategy at Imfiny is to prepare for them and train people to be ready. If you are ready there is a better chance you stay cool in the face of the fire.

Why structuring the response

Incidents range from minor to critical : just like a fire. It might just be a little grease fire or it might be a huge blaze going through a whole storage unit. But what might start as a grease fire might evolve into a complete building being destroyed. This depends on how the fire, or incident, is handled.

  • Being prepared means, first, to have way to be aware of the fire (monitoring and alerting).
  • Being prepared means knowing what to do when the symptoms of the fire are spotted (smoke, flames).
  • Being prepared means knowing how to organise the team to tackle the fire.
  • Being prepared means knowing how and where to communicate while the team is responding to the incident.
  • Being prepared means knowing how to organise the subsequent work needed to complement the incident resolution.

So, any team should have, at the very least, a defined way to structure the response to an incident. It should be clear what to do whenever the first breaks out : gather a response team, figure out what’s happening, figure out and prepare a mitigation step or a resolution step, communicate to stakeholders along the way and queue future work if need be.

Those are the very basic way to structure the response to an incident. Most guides out there on the topic will go in details about this but that’s the gist of it. So, here follows one example of how to do that.

Some great references

First of all we can’t start such a piece without referencing a couple of great resources on the topic, mostly out of the Google SRE books :

  • https://sre.google/sre-book/managing-incidents/
  • https://sre.google/workbook/incident-response/
  • https://sre.google/sre-book/emergency-response/

They are great and our usual coaching runs along similar lines with some variations based on our experiences.

A basic response process

When the incident starts a team of at least 2 should gather (depending on the team size), a few more from the start can be helpful but not a must.

Those 2 people :

  • gather in a “situation room” : physical or virtual, that’s their choice but they should be able to talk to each other (audio at least, video works nicely too).
  • have a way to keep track of what’s being done, found and tried : a shared document of sorts (incident response log)

Then those first 2 should focus on 2 things :

  • Communication to stakeholders : “this is happening”, at least to the engineering team but also to the whole company. Updates, even if brief and just stating “we are still figuring out the issue” should be issued regularly. A specific place to do that should be identified and known among the whole company (a specific channel on slack, a specific page on the internal tech website, …)
  • Identify the issue : once the incident is spotted it’s important to know how big it is in order to figure out a plan and call up help if need be

Two things to do, two people : it should be obvious that one person should be the communication person and the other one should be in charge of figuring and solving the issue.

Once that is started it might be time to call up more people to help. If that happens then it’s important to properly structure the team around the following roles :

  • Communication
  • Operational work
  • Incident Command
  • Planning

The reason behind that is to keep responsibilities limited thus allowing each individual to focus on their responsibility and do it best.

At first, “operational work” and “incident command” might be mixed into one role while “communication” and “planning” are mixed into another. But as the number of people involved grow it’s important to spread the roles accordingly.

If you start to have several people doing operational work it’s important to have them do their work in an organized way : that’s the role of the incident command. Having people walking on each other toes when solving an incident is a very dangerous thing that can have heavy consequences.

In a similar approach, having someone taking care of the communication allows for the rest of the team to stay focused on their work. That person not only updates the rest of the company but might also take care of keeping a log of what’s been seen, decided and done through the incident response log.

It can also be useful to have someone taking care of filing tickets for tasks to be done later, ordering food or drinks and organizing hand-offs when needed.

What should be done ?

As said before, the first step is to figure out what’s happening and then communicate about it.

This can include grading the incident :

  • Minor : there is no impact to the core features of the product
  • Average : product use is degraded
  • Critical : product use is impossible or severely degraded

Gravity of incident can also evolve while it’s being responded to. As pointed out in the comparison with fire : a small, grease fire, can grow into a full blown blaze engulfing a building. A minor incident can evolve into a critical one either because consequences pile up with time or because a bad idea make things worse.

With the severity figured, often comes the impact of the problem. This might require some additional investigative work though, and that’s ok.

In any case the team should then turns toward figuring out if the problem can be solved quickly, or if a mitigation is possible or needed before reaching full resolution.

Then the people in charge of the operational work should get to work as directed by Incident Command.

During that process :

  • Communication should be done regularly : keep the stakeholders and the team up to date
  • Introspection should happen : are you ok ? Are you panicking ? Are you feeling overwhelmed ?
  • Consider alternatives : you might have set on a path to resolution at some point but there is nothing wrong from changing course if you find new facts. Don’t hesitate to re assess the plan with your team from time to time.

Once a solution has been found and put in place, the team should go through a debrief. If it has been done properly during the incident response, the incident document should already be enough to go through what happened, what has been seen, decided and done. At that point the aim moves from “solving” to finally put down why the incident happened, what has been seen, what has been done, what are the consequences of both the incident itself and of the resolution steps that have been done and what we learn from all this for the future.

At the end of the debrief and in the days following it a final report should be prepared and presented to the team and stakeholders. This is what we called a “post mortem”. It should state clearly the severity of the issue, the how and why the incident happen, how it was tackled and the learnings we can get from all this.

Staying ready

“Practice makes perfect” as the saying goes. So don’t hesitate to prepare for incidents by faking ones, using some support issues as incidents and solving them as if they were incidents. You can also do some role playing with one lead or senior engineering playing the part of being the eyes and ears of the team during a prepared incident scenario.

This can go as far as actually creating incidents voluntarily with bricks such as the famous “chaos monkey” from Netflix.

Finally, incident response isn’t the only responsibility of a selected few in the team, nor are the roles within incident response teams. Everyone should be familiar with each role and facet of incident response. So don’t be afraid of taking up a role you have not done before or getting someone to do a role they are not used to. The aim is not to put someone at risk of failures but instead give them the opportunity to learn and hone their skills.

We can help

We have helped several teams set up or improve their existing incident response strategies. This outline is but a base for such a strategy and we would be happy to chat with you to see how to put one in place in your team. Contact us, let’s have a chat.

A RubyOnRails consultancy based in the EU, we build your applications and services all over the world !

Contact

contact+rails@imfiny.com

© 2021. All rights reserved.