When Amazon’s S3 cloud storage service at Amazon’s US-EAST-1 data centers went down, things went bad. Very, very bad.
How bad? Popular sites such as Quora, Business Insider, Netflix, Reddit and Slack either crashed entirely or were broken. By SimilarTech’s count, over 124,000 sites were affected. A college student told me, “It’s knocked out my school’s technology back end. Students are freaking out because they can’t access assignments.” A cloud consultant told me his phone was ringing off the hook by Amazon Web Services (AWS) customers who wanted to switch to Azure.
You get the idea. It was Bad with a capital B.
And, what caused this multimillion-dollar fiasco? A typo.
I quote from AWS’s explanation message: “The Amazon Simple Storage Service (S3) team was debugging an issue causing the S3 billing system to progress more slowly than expected. At 9:37AM PST, an authorized S3 team member using an established playbook executed a command which was intended to remove a small number of servers for one of the S3 subsystems that is used by the S3 billing process. Unfortunately, one of the inputs to the command was entered incorrectly and a larger set of servers was removed than intended. The servers that were inadvertently removed supported two other S3 subsystems. One of these subsystems, the index subsystem, manages the metadata and location information of all S3 objects in the region.”
Wow. That was one heck of a typo.
One of my favorite jokes is “To err is human. To really foul up takes a computer.” To that I can now add:”And to really, really foul up takes a cloud.”
People have been nervous about relying on public clouds for some time now. Well, as this just showed, they have reason.
Mind you, this wasn’t all AWS’s fault. Yes, there was a big foul-up at US-East-1, but the other AWS data centers were just fine. Customers, however, chose to keep all their IT assets in one AWS availability zone. If yours is a small business, that’s understandable. Running full-scale businesses in multiple zones isn’t cheap. Major companies, such as the ones listed above, don’t have that excuse.
Even smaller businesses can afford a cloud disaster recovery plan such as those offered by Acornis. But, as this AWS foul-up showed, companies both big and small didn’t bother to protect their systems from failure.
Switching to Azure, Google Cloud, 1&1, whatever won’t make it better. Every cloud will eventually go down.
Oh, and that cloud service-level agreement (SLA) that you thought would protect you? Think again. It covers, at best, your downtime. The damage caused to your business by that downtime is all on you.
Behind the storm clouds there’s a deeper problem. Tristan Louis, internet expert and entrepreneur, wrote recently that while the internet was designed to be decentralized to protect its services, public clouds “have created a model that increases concentration on top of few key players: Amazon, Microsoft and Google now host a large number of sites across the web. Many of those companies customers have opted to host their infrastructure in a single set of data centers, potentially increasing the frailty of the web by recentralized large portion of the net.”
Therefore, as “every new largely centralized system that comes online, the internet becomes more brittle, as centralization creates an increased number of single points of failure.”
If you’re going to continue to invest more of your IT infrastructure in a big public cloud, I encourage you, first, to spread your resources across multiple zones. Next, if you want to be safe, it’s time to reconsider relying so much on the public cloud and look at private and hybrid cloud models.
In either case, there will still be failures, but at least you won’t have all your IT eggs in one fragile cloud basket.