Disruption

The Internet Was Supposed to be Decentralized and Resilient. So why all these Massive Outages?

July 26th, 2021

While everyone (except the especially nerdy) was sleeping, an oxymoronic consolidation has crept into internet websites, commerce sites, and business applications' underlying systems. That consolidation has effectively created mini anti-internets weaved throughout, and when those anti-internets have large enough problems they end up crashing systems globally.

Such a disruption happened this past Thursday taking out a wide range of major corporate websites and back office systems, including FedEx, Bank of Montreal, British Airways, Royal Bank, HSBC, and Airbnb, all the way to Playstation, and Steam's game platforms.

The outages was determined to be caused by system disruptions from Akamai and Oracle , two key providers of internet "cloud" infrastructure services, ultimately entirely the result of Akamai's service disruption.

Akamai's Edge DNS service helps route web browsers to their correct destinations and in doing so, provides redundancy, some measure of failover, and provides security services.

Akamai released: "We have implemented a fix for this issue, and based on current observations, the service is resuming normal operations. We will continue to monitor to ensure that the impact has been fully mitigated." — a little less than an hour after the outages had started.

The determination was that the outage was caused by a "software configuration update triggered a bug in the DNS system" that lasted "up to an hour" and was not the result of a cyberattack.

... this is not about severing your ties with cloud infrastructure providers, rather it's how you rely on them

Most affected sites and system services were restored in less than an hour. But the damage was done. The question is: how can the very decentralized genius of the internet be brought down by such a routine (and boring) issue? The answer: it's because scaled-up systems such as Akamai's Edge DNS effectively circumvents that very decentralized genius of the internet -- an anti-internet.

At this point it's time to remember what the internet was invented for: to create a self-directed multi-routing network that, by its very architecture, can find a route to its destination server no matter what -- almost. Even if parts of the internet network connections were interrupted or unavailable, the DNS routing (among other systems) infrastructure knows how to find a viable path even if it's less than ideal. Hence the genius of its resilience.

The moment you centralize a critical service out of that DNS service, or put too many dependant technologies under one platform or infrastructure, you've created an all-too-important consolidation service as a point of primary truth. The moment that centralized service becomes unstable, you've taken on so much responsibility that what's left on the real internet isn't enough to keep you going. If you don't appear entirely broken, you're at least broken enough to be pretty much useless.

In 2019, the average cost of critical server outages ranged between $301 000 and $400 000 USD according to a Statista 2020 report. While this represents broad spectrum of companies and sizes, the message is clear. Online systems of all flavours represent an increasingly critical economic piece of infrastructure attention.

Bottom up, not Top down

The challenge with this model starts with what exactly is being centralized, especially where, in the stack that's powering your CMS, website, commerce site, or company applications, its magic kicks in.

That's the crux of the issue, that CDN and broader Cloud Infrastructure Data Service providers have morphed into magical pixie dust providers - at least in popular culture. It's become commonplace to assume that because "we use Cloud-provider-X" means that you're "just protected" from issues ... somehow. Because it's magic pixie dust.

But they're good services, aren't they?!

They're great providers - yes! We use them in strategic parts here and there, we encourage businesses to use them (details to come on "how"). So what's the problem you ask, weary reader? It's all your eggs in one basket; or baby out with the bathwater, whichever expression you prefer. Increasingly it appears, in a totally not-scientific study, that organizations are treating cloud infrastructure providers as the de facto guardian of all things redundancy. In other words, they believe if they sign on that one dotted line, that the cloud infrastructure provider will do the rest regardless of all harmful impactful events with the possible exception of nuclear annihilation.

And that's simply not so. Clearly.

Take charge of your own stack redundancy. As mentioned earlier, this is not about severing ties with cloud infrastructure providers, rather it's how you rely on them. At a simple, high level, most applications are going to need at least a base stack of technical services:

website application: the part that you see when you first visit or login as customer or administrator
DNS services: this is less known or obvious, but (obviously) your website or system services need to be accessible to the world. DNS, or Domain Name System, holds the addresses of how all of your services get routed when someone, or something calls on your site or services. More about this later.
databases: where content lives; including inventory data, customer data, logs, purchases, all those important pieces of information that live under the hood; your website/application needs the databases to feed it information in real time
assets: at the beginning with many systems, assets such as images and code live within the website application most often. As audience and demand grow and becomes more complex, these can break out (abstract) away from the main application to its own space; thus freeing up resources for the web application to do one thing: respond to customers and staff.

So ... how do we bring together the magic of an already redundant internet, with an online presence, application, or more, that uses that internet redundancy to its advantage rather than circumventing it? The answer is both easy and complicated (I'm sure if you'd read this far you probably already understand this).

Breaking this down then to primary service-performance points there are applications or services (as above), and there are the routing points *to* those applications or services. Then there is what happens when the primary provider of that service is no longer functioning. This brings us to the first decision point: break down each point of service where a failure means an interruption to your site, application, services, all or in part. For example:

If primary DNS services stop, is there failover outside of your primary provider? (hint: it can)
If your web server becomes unavailable, is there a secondary presence outside of your primary provider? (hint: it can)
If your database server becomes unavailable, is there a sync'd clone outside of your primary provider ready to go? (hint: it can)
If your systems call on services such as streaming, image and asset server libraries (CDN), external data lookups, are there secondary clones outside of your primary provider? (hint: it can)

If it's not already obvious, each piece of your site, application, or ecosystem, that relies 100% on a singular cloud provider is at risk of precisely the cloud outage events that have been occurring with greater frequency. By analyzing your key data interaction points, you can take greater control of your ecosystem robustness and stack flexibility. And the best thing? It doesn't have to cost a penny more than you're already spending; just more focused awareness.

Global Hosting Domains

Also popular

Left Hand, Right Hand: Why Can't Staff See Your Order?

Point of Sale

Agility in Uncertainty: Scaling your Ecosystem Horizontally

Scale

Agility through Uncertainty with Real-Time Ecosystem Inventory Awareness

Supply Chain

digBiz Podcast ep 18 | Mastering Ecommerce in a Shifting Economy

Podcast

The Human Element: How to Cultivate a Productive Digital Company Culture

Team

Boost Your Competitiveness With Decision Automation and Augmentation

Automation

Top 10 Challenges SMEs Face When Scaling Their Commerce Business

Scale

digBiz Podcast ep 17 | Supply Chain and Better CX Through Data

Podcast

Use Gap Analysis to help Uncover, Reevaluate, and Attain Your Why

Problem Solving

Why, What, and How: The Power of Creative Problem Solving Principles

Problem Solving

Testimonials

Why iTristan Group

How the technology works is important – but it's not primary. We focus on your business and the possibilities available to you now, and in the future. Thinking past technology shows where opportunities could take you.

We align our thinking with the idea; the creative; the crazy, the brilliant; the audacious and then consider how technology will be crafted to make it happen. This is liberating for you, and it makes the creative business building process a joy.

Left Hand, Right Hand : When In-Store Staff Can't See Your Online Order

The Internet Was Supposed to be Decentralized and Resilient. So why all these Massive Outages?

Also popular

Left Hand, Right Hand: Why Can't Staff See Your Order?

Agility in Uncertainty: Scaling your Ecosystem Horizontally

Agility through Uncertainty with Real-Time Ecosystem Inventory Awareness

digBiz Podcast ep 18 | Mastering Ecommerce in a Shifting Economy

The Human Element: How to Cultivate a Productive Digital Company Culture

Boost Your Competitiveness With Decision Automation and Augmentation

Top 10 Challenges SMEs Face When Scaling Their Commerce Business

digBiz Podcast ep 17 | Supply Chain and Better CX Through Data

Use Gap Analysis to help Uncover, Reevaluate, and Attain Your Why

Why, What, and How: The Power of Creative Problem Solving Principles

Testimonials

"We have to figure out how to invest more with you guys. We've saved over 500 person hours this holiday season while still adding 30% sales in this quarter alone!"

"Working with the iTMG team has been an excellent experience. Creative outside-the-box thinking, great responsiveness and a product that we can build on for years to come. Highly recommend, and thank you."

"you guys rock!"

Why iTristan Group

Left Hand, Right Hand: Why Can't Staff See Your Order?

Agility in Uncertainty: Scaling your Ecosystem Horizontally

Agility through Uncertainty with Real-Time Ecosystem Inventory Awareness

digBiz Podcast ep 18 | Mastering Ecommerce in a Shifting Economy