"we are still investigating"
"the cause of it is still unknown"
"sorry for the inconvenience"
"we are working to resolve the issue"

INCIDENTS, FAILURES OR BREACHES HAPPEN ALMOST EVERY DAY, AT EVERY COMPANY
Measuring the business impact of technology performance, 2013
Over 67% of security incidents are detected by third-parties
Verizon Breach Report, 2013
It takes over a month to clean up after an incident
Ponemon Institute, 2014 Global Report on the Cost of Cyber Crime

Life is good and the morning sky is blue. It’s going to be a productive, stress-free day. Then suddenly, an IT incident strikes — out of the blue. Every gauge, graph, and alert screams. You’ve been hit by an IT tidal wave. You fire fight. What the? Where did it start? You hunt for the root cause, but your environment is complex. You try not to panic. You tinker and adjust for hours – but it’s just trial and error. You feel helpless and frustrated. At last, everything seems to be back to normal. For now. More luck than judgement. It could have been much worse. And you think to yourself, next time, no more nasty surprises. But, how? What could we have done differently? There must be another way.

Situations like the one above are all too familiar. Modern systems – from web applications to industrial components and business systems – can fail without being noticed; then develop into an IT tidal wave

The black box problem

Systems fail, change and get hacked without anyone noticing until it’s too late

The simple question:

How will you detect an unknown incident?

Failure isn’t noticed

In the black box, incidents are spotted due to their materialised failure; you can’t reach a service, the experience is frustratingly slow, or an important transaction is lost. This is what to expect of your black box, since we don’t see what they are up to.

Intrusions aren’t detected

In the black box, manipulations and security incidents go unnoticed. Intruders are free to act without restriction. Their presence will be detected only if significant system failure occurs, or their actions disclose it. This is what to expect of your black box. As visibility has been unreachable.

Changes often backfire

In a black box, you can’t see how the system actually responds to change. The only way to detect a negative change is if the system fails, or causes others to. Verifying change requires waiting for an undefined time. This is what to expect of your black box, since you can’t see its inner workings.

... and we can’t understand why

In the black box, grasping the cause of failure is incredibly difficult. Understanding the cause of an incident requires delving into an undefined number of systems, looking for random clues and trying to find the root cause. It is a process largely based on guesswork; and without experience, it is almost impossible. This is what to expect of your black box. As its internal behaviour is unknown, ignorance reigns.

What does this mean?

A growing problem

There will be over 26 Billion Units By 2020 - every single one a black box

Risk losing items of value

IT is trusted to be the platform from which business and government services thrive. IT services store information and provide resources on which we rely, on a daily basis. When they fail, there is a risk of losing data, financial instruments, brand identity and strength. As more and more of our services become digitalised, this risk increases. The impact from IT failure is no longer just an ‘inconvenience’.

Unproportionally high cost

As businesses and organisations face these failures, they invest heavily in the response – to cover damages and implement controls. Incidents demand resources and time. This all comes at a high cost. Ultimately, these costs are added to IT products and services as a (unnecessary) cost of doing business. Hereafter, IT products and services become overpriced, limiting widespread use.

Delayed innovation

As failure occurs, organisations become insular and defensive. In the after waves of incidents, they sense a lack of control that can only be mitigated through implementing more controls. Ultimately, this stifles creativity and innovation. In essence, the impact and fallout of failure begets poor quality services.

Why did we get here?

We are doing things

The problem is growing quicker than the capability to improve

Over-reliance on rules for detection

Rules are traditionally the primary means by which to detect incidents. But rules have a major drawback: they require a definition — thereby the need for the incident to have already happened. Relying on rules means that only previously seen incidents are visible. New ones go undetected.

Over-reliance on search for investigation

Searching, querying and investigating log data is the traditional way of investigating incidents. Just as with rules, they have a constraint: you need to know what to look for. For unknown incidents, the process of finding the root cause involves costly guesswork and time. If the correct one isn’t found, the wrong fix is applied. If the root cause is not found, it may happen again. Both cases consume hours, if not days.

Over-reliance on prevention

Much can be done to prevent incidents and failure from occurring. Implementing redundancy, having thorough incident response plans and processes all help. But most incidents can’t be foreseen; and if the failure of prevention hasn’t been calculated, the incident will be even more drastic.

Growing abstraction and complexity

The modern way of building systems focuses on abstracting parts. Cloud computing, virtualisation, platforms and similar computing infrastructures obscure parts of the system. The difficulty being, it gets harder to instrument, harder to investigate and harder to respond to issues when things fail.

How can we change?

A different focus

It’s not a question of money – it’s already being spent.

Stop making excuses

There are often excuses after an incident. Blaming complexity and dismissing as extremely ‘rare’ are common responses. Talk of investment often follows – but investment can only help mitigate short term damage. Unless the problem is addressed with openness and honesty, failure will continue to occur.

Stop taking shortcuts

No one likes to wait. But building applications, services and systems without fully understanding how they work is short-sighted. This lack of perception is highlighted when errors occur, whether the culprit is internal or external. Trust in technology, software and IT is effectively traded for short term gains.

Stop repeating what isn’t working

No doubt you’re investing in expensive tools, processes and experts. But it’s a problem of scale: it’s outgrowing your improvements. Everyone who provides software, apps and systems has the responsibility to think about ‘a different way’ rather than ‘more of the same’. And remember, it’s not an issue of cost – the money is already being spent, both directly and indirectly.

Sharing is caring

We believe obvious things should be obvious. Failures, hacks and crashes should be naturally detectable and understandable just because they divert. But to reach that state, we need to start raising the industry ambitions and think differently. Because even in a fast-paced industry, change doesn’t happen over night. Join us in this mission by helping us recognise the blackbox problem.

#blackbox Tweets

"we are still investigating"

"the cause of it is still unknown"

"sorry for the inconvenience"

"we are working to resolve the issue"