Global IT outage: All it took was a few lines of code and millions of machines were dead - the risks of complexity
This wasn't supposed to happen.
We were told that as the internet matured, that this kind of thing - a single error causing a domino effect taking out millions of machines - was supposed to become less and less likely. There would be more and more servers and cables distributed in more and more places, making a single point of failure increasingly unlikely.
Global IT outage latest: 'Huge increase' in 999 calls
Instead, what today's episode - in which an update from a company called CrowdStrike to customers using its services around the world essentially broke the Windows operating system on their computers - has underlined is that often the more complex a system becomes, the more vulnerable it is to collapse.
The irony at the centre of the chaos
The great irony, of course, is that CrowdStrike's raison d'être is to prevent moments like this from happening. The company's "Falcon Sensor" is a product used to prevent cyber attacks - a complex programme best thought of as a kind of super anti-virus package, which, in order to do its job, gets privileged access to more parts of your machine than regular software.
But it so happens the latest update to Falcon Sensor, uploaded overnight to computers around the world, had a dodgy bit of code in it, which caused Windows machines to crash.
How can it be resolved?
Right now, it looks as if the only way it can be resolved is by technicians rebooting each machine and manually deleting a particular file (C-00000291*.sys since you asked). In other words, spare a thought for your company's technicians, because they're about to have a long weekend.
But perhaps the most striking lesson from the episode is a more ancient one, laid out by historian Joseph Tainter in his 1988 book The Collapse of Complex Societies.
The more complex societies and systems become, the more vulnerable they are to collapse. Tainter was referring to examples like the fall of Rome or the collapse of ancient Mesopotamian civilisation, but one could just as easily apply the logic to modern examples.
Society's complexity is making us vulnerable
Lurking beneath Tainter's thesis was the point that often in a complex society of organisation actors might make decisions which seem sensible but, due to the complexity of the system and their inability to understand it, could actually make it more vulnerable.
Consider the subprime crisis which triggered the financial crisis of 2008. Mortgages were packaged and repackaged into assets sold, eventually, on to banks which had little understanding of their actual value and their risks. The more complex the system became, the less able people were to comprehend how exposed they were to a catastrophic failure, and the more vulnerable the entire edifice was to collapse.
Read more:
Charts show when outages peaked across services
IT outage 'causing disruption in majority of GP practices'
Now let's ponder the current IT malaise. Let's ask ourselves: how did it come to be that so many companies around the world had the very same bit of software installed on their systems, making them vulnerable to the very same lines of duff code?
After all, the vast majority of people working at the companies affected will never have heard of CrowdStrike. Like the bankers presiding over the financial crisis, they had no idea of the potential vulnerabilities lying within their systems.
But in recent years, as businesses have become more and more concerned about the risk of cyber attacks, they have begun to implement cyber security checks and regulations. These often took the form of a checklist some poor operative had to fill out: how many computers have you got? What operating system? Are they all online? What forms of cyber protection do they have? And so on.
Now, this might sound like frustrating red tape to many of you, but the reality is that these days some companies stipulate that anyone doing business with them must have fulfilled all the items on the checklist.
So all of a sudden, salespeople trying to do a deal would discover that they couldn't do it without complying with the checklist. The company's financial survival depended on being able to tick the boxes!
How one company became so powerful
And invariably one of the boxes in those checklists was: do you have an endpoint detection and response (EDR) solution? And if you didn't have an EDR solution (or, more likely, didn't know what one was) then invariably you googled EDR and looked for the world's biggest provider, which just so happened to be… CrowdStrike.
Perhaps you spoke to your IT provider and insisted that you needed an EDR. Perhaps they said: "oh I wouldn't do that if I were you" - but then… no EDR no sale.
This is a stylised example, of course, but you see how this kind of thing can happen.
And hence, gradually and imperceptibly, a large proportion of the world's companies came - mostly unbeknownst to their leaders - to be running the very same piece of software with direct access to the most privileged parts of their computers. And then all it took was a few lines of code and all of those machines were instantly dead - or rather, they faced the "Blue Screen of Death".
So there's a reminder here about the risks of complexity.
Too early to tell extent of disruption and economic damage
It's way too early to put a figure on how much disruption this episode has caused and how much economic damage wrought. The short answer is almost certainly: a lot. Millions of people around the world have been unable to travel, to communicate, to transact. It may well transpire that it has put lives at risk, given it has affected many doctors' ability to do their job.
Perhaps the best thing that can be taken from today's chaos is that it might just serve as a cautionary tale which could make our computers that bit safer and more stable in the future. It might remind bosses that cyber security decisions are more than box-ticking exercises - and sometimes installing cyber security software can backfire.
It reminds us how dangerous it is if everyone in the world is relying on the same provider. It reminds us about the need for redundancy - to have backup systems. It reminds us of the dangers of complexity.
This probably won't come as much consolation if you're one of those people whose holiday plans have been disrupted or your business messed around by the IT outage today. But it's something.