There has been much brouhaha over the recent EBS outage. I say EBS to be specific as this, despite the sensationalistic headlines, was not an AWS or EC2 outage. RDS was also affected as it is built on top of the EBS product. As has been reported just about everywhere, the outage affected many large websites that are built on top of AWS. Much of the banter was mainly around how the “AWS outage” had “brought down” these sites, which couldn’t be further from the truth. What brought down these services was poor architectural choices combined with a lack of understanding of the unique characteristics of cloud infrastructure.
Our friends at Twilio and SmugMug, along with SimpleGeo, were able to happily weather the storm without significant issues. Why were we able to survive the storm while others were not? It’s simple; anyone who’s spent a lot of time talking with the fine folks at AWS or researching the EBS product would know that it has a number of drawbacks:
- EBS volumes are not anywhere near as fault tolerant as S3. S3 has 11 9’s of durability. EBS volumes are no more or less durable than a simple RAID1.
- EBS volumes increase network IO to and from your instance to some degree and can degrade or interfere with other network services on a given instance.
- EBS volumes lack the IO characteristics that many IO-bound services require. I’ve heard insane stories of people doing RAID0 across dozens of EBS volumes to get the IO performance they’re looking for. At SimpleGeo, we RAID0 the ephemeral drives and split data across multiple ephemeral RAID0 volumes to get the IO we’re looking for.
This doesn’t mean that you should avoid EBS or that EBS is a bad product. Quite the contrary. I think EBS is a pretty fucking cool product when you consider the features and simplicity of it. What it does mean is that you need to know it’s weaknesses in order to properly build a solution on top of it.
I think the largest surprise of the EBS outage, and one that no doubt will be reviewed in depth by the engineers at AWS, was that an EBS problem in one AZ was able to degrade services in other AZs. My general understanding is that the issue arose when EBS in one AZ got confused about possible network outages and started replicating degraded EBS volumes around the downed AZ. In a very degenerate case, when EBS can’t find any peers in their home AZ, they attempted to reach across other AZs to replicate data to other peers. This led to EBS being totally unavailable in the originally problematic AZ and degrading network-based services in other AZs as this degenerate case was triggered.
All this being said, what is so shocking about this banter is that startups around the globe were essentially blaming a hard drive manufacturer for taking down their sites. I don’t believe I’ve ever heard of a startup blaming NetApp or Seagate for an outage in their hosted environments. People building on the cloud shouldn’t get a pass for poor architectural decisions that put too much emphasis on, essentially, network attached RAID1 storage saving their asses in an outage.
Which brings me to my main point: the cloud is not a silver bullet. S3, EC2, EBS, ELB, et al have their strengths and weaknesses, just like any piece of hardware you’d buy from a traditionally enterprise vendor. Additionally, the cloud has wholly unique architectural characteristics. What I mean by this is that the tools AWS provides you need to be assembled and leveraged in unique ways that differ, sometimes greatly, to how you’d build a system if you were building out your own datacenter.
The #1 characteristic that people fail over and over to recognize is that the cloud is entirely ephemeral. Everything from EBS, to EC2, to EBS volumes could vaporize in an instant. If you are not building your system with this in mind, you’ve already lost. When you’re dealing with an environment like this, which by the way most large hosted services deal with, your architecture needs to exhibit a few key characteristics:
- Everything needs to be automated. Spinning up new instances, expanding your clusters, backups, restoring from backups, metrics, monitoring, configurations, deployments, etc. should all be automated.
- You must build share-nothing services that span AZs at a minimum. Preferably your services should span regions as well, which is technically more difficult to implement, but will increase your availability by an order of magnitude.
- An avoidance of relying on ACID services. It’s not that you can’t run MySQL, PostgreSQL, etc. on the cloud, but the ephemeral and distributed nature of the cloud make this a much more difficult feature to sustain.
- Data must be replicated across multiple types of storage. If you run MySQL on top of RDS, you should be replicating to slaves on EBS, RDS multi-AZ slaves, ephemeral drives, etc. Additionally, snapshots and backups should span regions. This allows entire components to disappear and you to either continue to operate or restore quickly even if a major AWS service is down.
- Application-level replication strategies. To truly go multi-region, or to span across cloud services, you’ll very likely have to build replication strategies into your application rather than relying those inherent in your storage systems.
Many people point out that Amazon’s SLA for it’s AWS products is “only” 99.9% What they fail to recognize is those are per-AZ numbers. You can compound your 9’s in a positive manner by spanning multiple AZs. This means that the simple act of spanning two AZs takes you from 99.9% uptime on EC2 to 99.9999% If you went to three AZs, with an appropriate share-nothing architecture, you’d be looking at a theoretical 99.9999999% uptime. That’s 0.03 seconds of downtime a year.
This all being said, I think it is fair to give Amazon a hard time for a couple of things. Personally, I think the majority of the backlash this outage caused was driven by a few factors:
- Amazon and the AWS team need to more clearly communicate, and reinforce that communication, about the fallibility of each of their services. For too long, Amazon has let the myth run rampant that the cloud is a magic silver bullet and it’s time to pull back on that messaging. I don’t think Amazon has done this with malice, rather they’re a victim of their own success. The reality is that people are building fantastic systems with outrageously impressive redundancy on top of AWS, but it’s not being clearly articulated that these systems are due to a combination of great tools and great architecture – not magic cloud pixy dust.
- Amazon’s community outreach and education needs to be much stronger than it is. They should have a small army of AWS specialists preaching proper cloud architecture at every hackathon, developer day, and conference.
- A lot more effort needs to go into documenting proper cloud architecture. The cloud has changed the game. There are new tools and, as a result, new ways of building systems in the cloud. Case studies, diagrams, approved tools, etc. should all be highlighted, documented, and preached about accordingly.
So what have we learned? The cloud isn’t a silver bullet. You still have to build proper redundancy into your systems and applications. And, most importantly, you, not Amazon, is ultimately responsible for your system’s uptime.