Why We Will Always Need Incident Response (no matter how technology advances)

Ky Nichol
Nov 18, 2021
5 min read

Technology continues to evolve at a rapid rate, and with it the expectations for increased agility, accelerated delivery, and total resilience. There’s no doubt that we’re seeing sophistication on a new level, but we need to acknowledge that the complexity of managing the technology estate of any business is significantly increasing across applications, infrastructure, teams and systems of record. So, even as technology evolves and new challenges arise, we all know that the realm of possibility for things to go wrong still exists, and within that lives requirement for incident response - the need to respond to the unintended consequences of these increasingly complex systems. There will always be unintended consequences, and events that aren’t planned for, the only difference being that the balance of technology and humans to plan for, detect and react to those changes has

shifted.

We know that automation is accelerating change, but with change comes complexity, and that complexity means that complete automation is near-enough impossible for a decent amount of time yet. We cannot account for every possible outcome in a given system, and as such, we’ll need humans for some time yet to step in and manage those unintended consequences. I take some guidance mathematically on this from the wonderful work of Bertrand Russell and Alfred North-Whitehead to completely bound everything in formal logic. Then Godel’s incompleteness theorem came to the scene, if my vague recollections are correct, to show that there will always be unprovable truths, and so it’s logically not possible to model the entirety of any complex system. What does that mean? These theorems I think help us understand that the formal systems we use are not complete - there’s still room for surprise - and back in the context of business operations and incident response - we need to plan for that, with the tech to support.

So, there’s going to be some rough edges, and we know that we can’t capture in entirety the algorithm for a system to cope with all unintended consequences. Organizations today need to be able to continually provide business services, even in the face of adverse operational events, and so we need to find the best way to identify and respond to disruptions. At the moment this will require humans to step in and make important, nuanced decisions in this ‘unplanned’ work where they may only be able to pave the steps to take for the next 15 minutes until more data arrives to take further sensible steps. Together, humans and technology are accelerating that capability, but working within the parameters of understanding that we will only know so much, and that this work is particular as you don’t often deterministically know the route to successful conclusion at the start and that things will be iteratively revealed and progressed. For example in responding to an incident where we don’t know root cause yet, firing off cheap, fast automated health checks to find as much rich data as possible to provide the data insight to then power the right decisions.

Good progress has been made to alert on conditions where very good automation and advanced technical architecture is going off the rails in edge cases. These provide an early indication rather than waiting for a customer to call. This has progressed with AI Ops and other technologies distilling the huge numbers of alerts to convert good signal from noise to trigger some action with the right people in the organization. The tricky thing, though, is that in and amongst this layer of complexity, it’s very unlikely that you can simply automate the response in its entirety, as they will never be similar enough to trigger a canned response.

Each incident will need something a little more bespoke or nuanced, requiring human intervention. That means that you arrive at a situation where the right people are aware, but what do they do next? Often we are then left with the right people being on an open bridge call with little or no ability to orchestrate across humans and technology in a nuanced manner to resolve things quickly.

Looking back, after you’ve solved the incident, it’s a political football in terms of accountability, and that’s where the new definition and significance of observability comes in. It’s historically been very difficult and time consuming to get hold of a full audit trail in terms of the ins and outs of a particular incident, especially what went on in solving it, the stakeholders involved, and information exchanged across multiple communication channels, shared documents, command line interactions, the works. What’s essential, especially where senior stakeholders and often the regulator can come knocking, is a fully observable way of seeing the incident ‘in flight’ and a full audit trail for afterwards in one place. When we encounter an unprovable truth, per Godel’s theorem, we want to make sure we can capture how to tackle it and move to a state where any future encounter can serve as a canned playbook, as opposed to previously unplanned or ad hoc.

On that basis, building out a documented, indelible audit trail, automatically capturing all actions during an event, means that responses and decisions are planned, documented and supported and teams can respond to regulatory and executive questions, in addition to building a clearer understanding of what happened, adding into the playbook for future occasions.

Operational resilience is at the centre of board discussions across every organization at the moment. As we continue to advance the technology of our systems and we increase our pace to deliver increasingly competitive solutions for customers, we see more and more of these events occuring, bringing with them reputational, economic and regulatory implications. Incident Response has always resided in the land of a very few experienced, well respected practitioners who could handle the complexity, lack of visibility, the pressure, the accountability and the second guessing. It’s people who run incidents who I’ve got the most respect for because you have to maintain that composure, working step by step to make sure that you never get burnt twice. It’s time to help these first-responders of the IT world with better equipment and more executive support, having worked in the face of adversity up until this point. When an incident happens, the faster you respond to it, the less it’s likely to negatively impact, and to do that the balance of teams and technology is essential. For some time we won’t be 100% automating emergency services, it will be a great hybrid of human and machine. But we can use that automation to help dynamically orchestrate and capture our response to major disruption. From discovery, reacting, solving the problem in hand to remediating everything that went wrong, we can use technology to augment our incident response. It’s a hybrid game, and that’s paving the way forward to even quicker and smarter recovery.

Comments