TL;DR: The real problem is there isn’t enough capacity. The us-east1 region is too big and during an outage there simply aren’t enough resources to allow users to recover their sites.
In part 1 I discussed how the bug-rate on AWS doesn’t seem to be getting better. A few bugs aren’t necessarily a big deal. For one, they’re expected given the trailblazing that AWS is doing and the incredibly hard & complex problems they’re solving. This is a recipe for bad things to happen sometimes. To account for this the true selling point in AWS has always been “if something goes wrong, just get more instances from somewhere else and keep on running”.
In the early days (pre-EBS), good AWS architecture dictated that you had to be prepared for any instance to disappear at any time for any reason. Had important data on it? Better have four instances with copies of that data in different AZs & regions along with offsite backups, because your data could disappear at any time. Done properly, you simply started a new instance to replace it, copied in your data, and went about your merry way.
What has unfortunately happened is that nearly all customers are centralized onto us-east-1. This has many consequences to the architecture model described above.
A very common thread in all of the us-east-1 outages over the last two years is that any time there is trouble, the API & management console becomes overloaded. Every user will be trying to move and/or restore their services. All at once. And the API/console has shown to be extremely dependent on east-1. PagerDuty went so far as to move to another region to de-correlate east1 failures from their own failures.
Competition for Resources
Once again, by virtue of us-east-1 being the largest region, whenever there is an outage every customer will start trying to provision new capacity in other AZs. But there is seldom enough capacity. Inevitably in each outage there is an entry in the status updates that says “We’re adding more disks to expand EBS capacity”, or “We’re bring more systems online to make more instances available”, and so forth. You can’t really blame Amazon for this one: They can’t keep the prices they have and always be running below 50% capacity. But when lots of instances fail, or lots of disks fill up, or lots of IP addresses get allocated, there just aren’t enough left.
This is a painful side effect of forcing everyone to be centralized into the us-east1 region. us-west has us-west1 & us-west2 because the datacenters are too far apart to maintain a low-latency connection to put them into the same regional designation. us-east has a dozen or more datacenters, and thanks to them being so close, Amazon has been able to call them all ‘us-east’ instead of ‘us-east1′ and ‘us-east2′.
But what happens when a bug affects multiple AZs in a region? Suddenly, having all the AZs in a single region becomes a liability. Too many people are affected at once and they have nowhere to go. And all those organizations that have architected under the assumption that they can “just launch more instances somewhere else” are left with few options.
P.S. I know things are sounding a little negative, but stay tuned. My goal here is first to identify what are the truly dangerous issues facing AWS, and then to describe the best ways to deal with them as well as why I still think AWS is the absolute best cloud provider available.