Patterns of incident remediation

This post is a summary of a talk presented 27 February, 2018 at the Stockholm DevOps Meetup on patterns of behavior when remediating incidents.

Recently, John Allspaw proposed an interesting thought-experiment in his talk at the devops enterprise summit:

“Imagine your organization. What would happen if today at six o’clock all of your companies took their hands off the keyboard? They don’t answer any pages. They don’t look at any alerts. They do not touch any part of it, application code or networks or any of it. Are you confident that your service will be up and running after a day?”

I asked the same question at the start of the talk and consensus in the room was that most systems would stop running in less than 24 hours.

Our engineers and other team members do a lot of things to keep systems running. Part of what they do is remediate incidents, large and small, in running systems.

In my previous role as Program Manager for Spotify’s cloud migration, I had a unique opportunity to observe a large number of teams remediate incidents.

Spotify had 1000+ backend services in 2017, all of which needed to be migrated to Google Cloud Platform. To help engineering teams at Spotify lift and shift their backend service from their on-premise data centers to Google’s cloud platform we staffed a small engineering team to help them and conducted joint migration sprints with the engineering teams that owned the backend services.

Each sprint ended with reliability and failure exercises to build trust in the new cloud environment and practice remediating problems on the new platform. At the end of each migration sprint, we became a small team of human chaos monkeys that would deliberately insert errors and failures into the newly migrated services while taking (limited) production traffic.

The sample set for the patterns of incident remediation comes from this migration process across approximately 100 engineering teams. Based on this experience, we’ve identified 5 patterns that help engineers remediate incidents.

Pattern I: Discuss business impact as a trade-off

Most engineering teams have a good gut feeling about the impact of an incident. Engineering teams that were good at remediation focused on business impact rather than engineering pain. And had an active conversation with “the business” on impact and reliability goals.

They framed these discussions as a trade-offs, which is important because when asking anyone what level of reliability they would prefer they will almost always expect at least  99.999%. Success in these conversations depends everyone’s understanding of the cost of these trade-offs.

Pattern II: Move from MTBF to MTTD / MTTR

Early in the life of system, Mean Time Between Failure for system components is important as failure modes are unknown and most of your incidents will be caused by unambiguous failures such as a full disk, a broken server/instance, crashed application, a melted switch or other another common error. A typical pattern is that as soon as you find one of these causes, you build a metric graph for the relevant component. Successful teams realize that their investment in MTBF is no longer paying off and shift to using to Mean Time To Detect and Mean Time To Repair.

In part, this is because well working components do not guarantee a well-working system due to emerging failures in the interaction between components. And partly because some failure is inevitable and improving reliability means focusing on your ability to rapidly detect and respond to those failures.

Pattern III: Practice investigation and remediation

Investigating and resolving incidents takes skill. The worst time to practice these skills is at 2AM when your pager goes off, but you haven’t had enough sleep and are stressed because of the incident.

Like any skill, incident investigation and remediation skills improve with practice and more experienced teams know this. Practicing is also a great way for members of an engineering team to learn from each other. In addition to practicing skills and learning from each other, going through an incident simulation allows you to practice and improve the overall process of incident response such as communication, escalation, PR aspects and security aspects.

Pattern IV: Breadth first search followed by Depth deep dive.

So how do teams actually investigate ongoing incidents? A common behavior pattern we observed is a breadth first search on possible hypotheses for an incident, pick the most likely candidate and then depth dive into the actual hypothesis. Rinse and repeat. This is a really effective strategy to quickly cycle through a set of possible contributors to an incident and find your options to resolve it. It allows a team to parallelise work across the team for the most likely hypothesis. This strategy is also part of the stella report and Allspaw’s talk at DevOps Enterprise Summit

Pattern V: Disproving correlations is easier than proving causality

Often an hypothesis for a contributing factor to the incident under investigation starts with a correlation. The natural question to ask once you have found a potential correlation between your incident and system behavior or metrics is “is this true?”.

However, it is often much easier to test and thereby answer the question: “is this false?”

Disproving a correlation is often much easier than proving causation for a correlation.

Comic by xkcd.com

The unanswered pattern: Is this “normal” behavior?

The final pattern that does not have strong tooling support or a clear path for improvement is the understanding of “normal” system behavior. When evaluating hypotheses for incident contributing factors this is a really common question, “does this behavior happen normally or is there something that is different?”

Most of dashboards and analysis tools target detection of incidents, but do a poor job of showing and understanding normal repeating behavior of a system.

Answering this question is often based on the experience of the people involved in incident remediation, but a rough estimate is that up to 70% of the time spent on remediating an incident is actually spent on sorting out the answer to this question.

Ask yourself, if a new member joined your team today, how would you show them how the system you operate runs in normal state? If anyone has good tools or options for solving this problem, we would love to hear from you.

The complete slides from the talk are available to download or if you want to connect with us join our Slack community.

Written by Ramon van Alteren
on March 15, 2019