Heading Towards Zero Bugs in Production

Any team responsible for building and running software will have an intimate understanding of the tradeoffs of deploying new code and fixing bugs.

Where there is code, there will be bugs. Where there is change, there will be new bugs. And there is a limit to how much time that can be spent making sure code is bug free. It’s impossible to completely prevent them.

The challenge is that unknown, latent bugs and failures is the most prevalent causes of outages in otherwise reliable systems. Latent failures, by some referred to as dark debt, shows clearly in many incident post-mortems that describe how systems crash in surprising ways, that systems’ behavior was not clearly understood before hand, and that despite hours of research the root cause is still unknown. This is likely because when complex systems fail there is no singular root cause. Failures happen because unknown, latent flaws act together to form an outage.

Latent flaws and bugs can be argued to be a side effect of complexity and resilience itself. In the pursuit of building systems that run despite failure we minimize the impact and urgency of individual flaws. For example, in high availability set ups we make sure that a system can operate even during a single node failure. While this is good for the short term, it can leave us to not prioritise the local failure as important enough. When we don’t fix problems, the list of latent flaws grows and systems become brittle over time.

Moving towards having zero bugs in production will reduce vulnerability, make systems less surprising, simpler to operate and easier to change.  

Paving path for zero bugs

Prevention of bugs eventually fails. Fortunately, there are multiple tactics we can use to form a strategy.

1. Run less code

Code that doesn’t run, is code that won’t have bugs. Practicing code re-use and utilizing third party software-, platform- and infrastructure-as-a-service alternatives are all great strategies for having less code that you are responsible for. More time can be spent caring for the code that you do run.

2. Build-based standardized technology patterns

For software and services you do choose to run, it is good to choose technologies and patterns that others have had success with. By picking software where failures have been rooted out in production already, you will likely run with less latent flaws than for custom code. Also, by using best practices for deployment and maintenance you reduce the risk of making mistakes.

3. Write tests to verify the successful path

You should know how your application is supposed to work. Writing tests that verifies the logic and constraints of code is a great way to make sure obvious bugs doesn’t make it into production. How much time should be spent on this is a source of debate, but I think that we can all agree that there is value in producing such tests for your code.

3. Use anomaly detection to surface surprises

For flaws that are already in the code, or that is making it into the code with code changes, anomaly detection is the best way to surface them. Non-impacting failures and failures in interactions between software are notoriously hard to test for and hard to detect with conventional observability tools, but could often be made obvious by identifying outliers and news in data.

4. Introduce chaos to expand runtime patterns

Latent failures are latent until something triggers them. If you can find ways of triggering bugs then there is an opportunity to fix them at your own schedule. And there are plenty of ways to do it. Fuzzing inputs to applications, doing load tests in staging environments, letting a chaos monkey randomly turn things off will likely show the boundaries of success faster.

5. Let developers own the full life cycle of code

Knowing that a bug exists is only the beginning, the real value is in fixing them. Make sure that the developer pushing code changes are also responsible for it as it runs in production. This will lead to greater care and fixing things will likely be quicker as it will directly impact their life at work.

6. Make small changes frequently

If you deploy changes every day, you can deploy fixes every day, too. Smaller changes leads to smaller issues and smaller fixes. And by practicing frequent deploys you will likely frequently add bugs and become great at fixing and even preventing them. This is why high performance devops teams can have 46 times more deployments and 7 times lower failure rate and 2604 times faster recovery time, according to the Accelerate report by Dora.  

Will we ever achieve zero bugs?

Probably not, but actually achieving zero bugs isn’t the point. Setting this goal puts us on the path towards building systems that are not only available, but also cheaper to operate, have less surprises, require less experienced operators and are easier to change without fear.

With the rapid improvement of technology and practices around operating software there are far greater opportunities than ever before to care for our code. It starts with setting the vision.

Written by Göran Sandahl
on April 18, 2019