To borrow some financial market metaphors, it’s hard to argue that cloud providers aren’t a “systemically important” part of the Internet. If one fails catastrophically, it’s more probable than you might think for others to quickly follow.
Compared to 10 years ago, Infrastructure-as-a-Service (IaaS) has greatly simplified web engineering. Gone are the days of assembling and racking servers – experiences that every start-up experienced in the 90s and even well into the Web 2.0 era. But any simplification involves trade-offs, and the rise of IaaS is no different. One to consider is reliability.
With IaaS, you’re outsourcing a good chunk of your reliability to a 3rd party. They give you an SLA, which at best compensates you for outages, but it’s usually limited to what you’re paying them – not the true cost to your business of the outage. At best, the SLA aligns interests a bit; they suffer when they cause you to suffer, albeit it not as much. In practice, this often works OK. Cloud providers are pretty reliable, and its very difficult for any young engineering team to credibly claim that their home-grown systems architecture is going to be more reliable than Amazon Web Services (AWS.)
But what if AWS isn’t reliable enough for you? A straightforward approach is to avoid being dependent on AWS, which is appealing for cost and lock-in reasons in addition to reliability. If AWS fails, you’ll quickly failover to Google App Engine (GAE), Azure, etc, right? But what happens when a lot more AWS users opt for the same approach? Then an AWS failure becomes a huge load increase on GAE, which could trigger it to fail as well. The probabilities of AWS failing and GAE failing are not independent.
That is a subtle point: The probability that AWS fails is low. The probability that GAE fails is low. The probability that GAE fails, given that AWS fails, is not as low.
And it’s hard to predict. How many AWS users have disaster scenarios where they migrate to GAE? How significant of an AWS failure would it take to cause them to invoke these failover procedures? How much spare capacity does GAE really have?
The logic above applies to all cloud platforms, not just GAE. The general reasoning is simply that there’s a relatively small set of major cloud providers, such that if a big one fails, so might others.
The more fragmented and commoditized the cloud infrastructure market is, the safer we all are. As long as AWS is the dominant player, you’re better off – from a reliability standpoint – picking someone smaller and relying on AWS as your failover. At very least, should your primary provider fail, you’re maximizing your chance that the aggregate hit to your failover will be manageable.