Life is good and the morning sky is blue. It’s going to be a productive, stress-free day. Then suddenly, an IT incident strikes — out of the blue. Every gauge, graph, and alert screams. You’ve been hit by an IT tidal wave. You fire fight. What the? Where did it start? You hunt for the root cause, but your environment is complex. You try not to panic. You tinker and adjust for hours – but it’s just trial and error. You feel helpless and frustrated. At last, everything seems to be back to normal. For now. More luck than judgement. It could have been much worse. And you think to yourself, next time, no more nasty surprises. But, how? What could we have done differently? There must be another way.
Situations like the one above are all too familiar. Modern systems – from web applications to industrial components and business systems – can fail without being noticed; then develop into an IT tidal wave
The black box problem
Systems fail, change and get hacked without anyone noticing until it’s too late
How will you detect an unknown incident?
In the black box, incidents are spotted due to their materialised failure; you can’t reach a service, the experience is frustratingly slow, or an important transaction is lost. This is what to expect of your black box, since we don’t see what they are up to.
In the black box, manipulations and security incidents go unnoticed. Intruders are free to act without restriction. Their presence will be detected only if significant system failure occurs, or their actions disclose it. This is what to expect of your black box. As visibility has been unreachable.
In a black box, you can’t see how the system actually responds to change. The only way to detect a negative change is if the system fails, or causes others to. Verifying change requires waiting for an undefined time. This is what to expect of your black box, since you can’t see its inner workings.
In the black box, grasping the cause of failure is incredibly difficult. Understanding the cause of an incident requires delving into an undefined number of systems, looking for random clues and trying to find the root cause. It is a process largely based on guesswork; and without experience, it is almost impossible. This is what to expect of your black box. As its internal behaviour is unknown, ignorance reigns.
What does this mean?
There will be over 26 Billion Units By 2020 - every single one a black box
IT is trusted to be the platform from which business and government services thrive. IT services store information and provide resources on which we rely, on a daily basis. When they fail, there is a risk of losing data, financial instruments, brand identity and strength. As more and more of our services become digitalised, this risk increases. The impact from IT failure is no longer just an ‘inconvenience’.
As businesses and organisations face these failures, they invest heavily in the response – to cover damages and implement controls. Incidents demand resources and time. This all comes at a high cost. Ultimately, these costs are added to IT products and services as a (unnecessary) cost of doing business. Hereafter, IT products and services become overpriced, limiting widespread use.
As failure occurs, organisations become insular and defensive. In the after waves of incidents, they sense a lack of control that can only be mitigated through implementing more controls. Ultimately, this stifles creativity and innovation. In essence, the impact and fallout of failure begets poor quality services.
Why did we get here?
The problem is growing quicker than the capability to improve
Rules are traditionally the primary means by which to detect incidents. But rules have a major drawback: they require a definition — thereby the need for the incident to have already happened. Relying on rules means that only previously seen incidents are visible. New ones go undetected.
Searching, querying and investigating log data is the traditional way of investigating incidents. Just as with rules, they have a constraint: you need to know what to look for. For unknown incidents, the process of finding the root cause involves costly guesswork and time. If the correct one isn’t found, the wrong fix is applied. If the root cause is not found, it may happen again. Both cases consume hours, if not days.
Much can be done to prevent incidents and failure from occurring. Implementing redundancy, having thorough incident response plans and processes all help. But most incidents can’t be foreseen; and if the failure of prevention hasn’t been calculated, the incident will be even more drastic.
The modern way of building systems focuses on abstracting parts. Cloud computing, virtualisation, platforms and similar computing infrastructures obscure parts of the system. The difficulty being, it gets harder to instrument, harder to investigate and harder to respond to issues when things fail.
How can we change?
It’s not a question of money – it’s already being spent.
There are often excuses after an incident. Blaming complexity and dismissing as extremely ‘rare’ are common responses. Talk of investment often follows – but investment can only help mitigate short term damage. Unless the problem is addressed with openness and honesty, failure will continue to occur.
No one likes to wait. But building applications, services and systems without fully understanding how they work is short-sighted. This lack of perception is highlighted when errors occur, whether the culprit is internal or external. Trust in technology, software and IT is effectively traded for short term gains.
No doubt you’re investing in expensive tools, processes and experts. But it’s a problem of scale: it’s outgrowing your improvements. Everyone who provides software, apps and systems has the responsibility to think about ‘a different way’ rather than ‘more of the same’. And remember, it’s not an issue of cost – the money is already being spent, both directly and indirectly.