Why did Facebook, Instagram and WhatsApp go down and will it happen again?

051021 Facebook Logo AP
Credit: AP

Facebook, Whatsapp and Instagram are back online after being shuttered by a global outage for several hours on Monday. More than three billion users were impacted by the disruption, which began shortly before 5pm. Eventually, Facebook and Instagram became accessible from late on Monday evening, while WhatsApp said its services were “back and running at 100%” as of 3.30am on Tuesday.

It's extremely rare for online giants to shut down for such a long period of time, so what caused the problem, why did it take so long to be fixed and what impact will it have on the companies in the long-term?

What went wrong?

Facebook, which owns WhatsApp and Instagram, has blamed a “faulty configuration change” for the outage.

It said in a statement: “Our engineering teams have learned that configuration changes on the backbone routers that coordinate network traffic between our data centres caused issues that interrupted this communication. “This disruption to network traffic had a cascading effect on the way our data centres communicate, bringing our services to a halt."

However, web infrastructure and security firm Cloudflare has provided a detailed breakdown of the incident as it saw it unfold, and said it revolved around two key mechanisms which make the internet work – Domain Name System (DNS) and Border Gateway Protocol (BGP).

In essence, DNS is the address book and BGP the roadmap for the internet, helping people navigate the vast mesh of connected networks that make up the internet to help them find the website they want and then the quickest route to it.

Cloudflare said Facebook had, through a series of updates on Monday and seemingly accidentally, told the BGP that the paths for everything Facebook runs were no longer there – meaning people could no longer find a way to the social network.

Experts have said this is most likely to have been caused by a software bug in the updates or human error.

As of Monday afternoon, there was no evidence that malicious activity was involved.

Why did it take so long to be fixed?

Facebook said that as well as shuttering their platforms, the configuration change also hit its internal systems, "complicating our attempts to quickly diagnose and resolve the problem".

While much of Facebook’s workforce is still working remotely, there were reports that employees at work on the company’s California campus had trouble entering buildings because the outage had rendered their security badges useless. It is not clear whether these employees were needed to fix the problem.

One expert also noted that ongoing social distancing measures because of the pandemic and remote working may have also played a part.

Software testing expert, Adam Leon Smith of BCS, The Chartered Institute for IT, said: “It is unlikely the issues were directly caused by people working from home, however it is quite possible that it took so long to restore the service because of reduced staffing within the data centre.

“This would compound the problem because the nature of the failure meant that remote access to the data centre was also unavailable.”Was my personal data at risk?

According to Facebook, there is “no evidence that user data was compromised" by the outage.

How will the companies be impacted in the long-term?

Facebook’s share price plummeted 4.9% amid the outage and according to Bloomberg, CEO Mark Zuckerberg's personal wealth took a £4.4 billion ($6 billion) hit.

Facebook CEO Mark Zuckerberg Credit: Niall Carson/PA

The disruption could have further wounded Facebook after a whistleblower claimed in a US interview that the the company places people before profits. Just 24 hours before the technical issues, Frances Haugen, a former Facebook product manager, alleged in a US interview that the company knowingly uses algorithms to spread false information.

How can an outage of this scale be avoided in the future?

This latest incident, after the major outages linked to Cloudflare in 2020 and Fastly earlier this year, again highlights the potential problems with having large portions of the internet reliant on just a handful of large companies.

There are currently no obvious solutions to this, but this latest outage is likely to reignite the debate around internet infrastructure.

Facebook itself acknowledged that the outage impacted people and businesses globally, adding it was further reviewing what happened "so we can continue to make our infrastructure more resilient".