Instagram, WhatsApp and Facebook outage: What caused the crash?

·4-min read
Thumb down: the Facebook corporate headquarters  in Menlo Park, California (AFP via Getty)
Thumb down: the Facebook corporate headquarters in Menlo Park, California (AFP via Getty)

Facebook has finally explained something of how its main app, Instagram, WhatsApp and many more besides were able to go offline in one of its biggest shutdowns in history.

The company says it was an internal problem, rather than a cyber attack from outside. And it says that there is no indication any data was compromised.

But some things still remain unclear, including how it was able to happen and what Facebook did to fix it.

Facebook’s explanation was cryptic, as it usually is after its failures. Its blog post was a mere four paragraphs long, only two of which were devoted to explaining what happened and how.

“Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centers caused issues that interrupted this communication. This disruption to network traffic had a cascading effect on the way our data centers communicate, bringing our services to a halt,” wrote Santosh Janardhan, Facebook’s vice-president for engineering and infrastructure.

“Our services are now back online and we’re actively working to fully return them to regular operations. We want to make clear at this time we believe the root cause of this outage was a faulty configuration change. We also have no evidence that user data was compromised as a result of this downtime.”

The other paragraphs apologised to the various, “people and businesses around the world who depend on us”, and that Facebook understands, “the impact outages like these have on people’s lives, and our responsibility to keep people informed about disruptions to our services”. It also committed to learn more – but not necessarily share more – about what had happened and how it can be avoided in the future.

Facebook also confirmed a rumour that had been circulating, saying that the failure itself had hit the, “internal tools and systems we use in our day-to-day operations, complicating our attempts to quickly diagnose and resolve the problem”. It didn’t qualify the extent of that but it meant that Facebook staff were unable to communicate with each other about the “outage” or to access the systems required to bring it to an end.

From Facebook’s explanation, as well as information gathered during and since from outside the company, it seems clear that the problem was to do with the way that internet traffic is routed around the world – by two important technologies known as domain name service – DNS – and Border Gateway Protocol – BGP.

Though they do different things, the fundamental role of both of those technologies is basically the same: they are something like address books or route maps for traffic to know where it is going. When the records were changed or deleted, just before the stoppage began, apps and browsers couldn’t find their way to Facebook’s content and the whole thing fell over.

But other things remain unclear. Facebook has given no indication of how the records were able to be changed without sufficient safeguards in the first place, given they are so vital to the running of the company.

These gaps in information are dangerous, not just for Facebook’s reputation but also for the public understanding of the truth. Without verifiable or clear updates on what caused it, conspiracy theorists were able to flourish. A number of people suggested that Mark Zuckerberg had intentionally taken the site down to deflect attention from the whistleblower scandal, for instance, but there is absolutely no indication that happened.

Some things will never be known, such as the true cost. While there are various ways of estimating it – Facebook loses $10m (£7.3m) per hour, for instance, and was down for six hours, and Mark Zuckerberg’s share holdings lost around $7bn – the true extent of the loss around the world will be very difficult to calculate.

Some of that spending that might otherwise have happened during it will come back straight away, since advertisers are likely to simply use that same budget now the site is back up. But some of it might be gone forever: Mark Zuckerberg has previously expressed worry over the fact that when users leave its services, especially its private chat apps, during an outage, then they might never come back.

In its apologies, Facebook looked to highlight the vast array of companies that rely on its services, a move that has also been central to its defence against both regulatory pressure and changes by Apple that limit how much Facebook can track its users. “To the huge community of people and businesses around the world who depend on us: we’re sorry,” read one exemplary message, part of the apology issued on Facebook’s Twitter account.

Our goal is to create a safe and engaging place for users to connect over interests and passions. In order to improve our community experience, we are temporarily suspending article commenting