The Cloud is not a Silver Bullet

There has been much brouhaha over the recent EBS outage. I say EBS to be specific as this, despite the sensationalistic headlines, was not an AWS or EC2 outage. RDS was also affected as it is built on top of the EBS product. As has been reported just about everywhere, the outage affected many large websites that are built on top of AWS. Much of the banter was mainly around how the “AWS outage” had “brought down” these sites, which couldn’t be further from the truth. What brought down these services was poor architectural choices combined with a lack of understanding of the unique characteristics of cloud infrastructure.

Our friends at Twilio and SmugMug, along with SimpleGeo, were able to happily weather the storm without significant issues. Why were we able to survive the storm while others were not? It’s simple; anyone who’s spent a lot of time talking with the fine folks at AWS or researching the EBS product would know that it has a number of drawbacks:

  • EBS volumes are not anywhere near as fault tolerant as S3. S3 has 11 9’s of durability. EBS volumes are no more or less durable than a simple RAID1.
  • EBS volumes increase network IO to and from your instance to some degree and can degrade or interfere with other network services on a given instance.
  • EBS volumes lack the IO characteristics that many IO-bound services require. I’ve heard insane stories of people doing RAID0 across dozens of EBS volumes to get the IO performance they’re looking for. At SimpleGeo, we RAID0 the ephemeral drives and split data across multiple ephemeral RAID0 volumes to get the IO we’re looking for.

This doesn’t mean that you should avoid EBS or that EBS is a bad product. Quite the contrary. I think EBS is a pretty fucking cool product when you consider the features and simplicity of it. What it does mean is that you need to know it’s weaknesses in order to properly build a solution on top of it.

I think the largest surprise of the EBS outage, and one that no doubt will be reviewed in depth by the engineers at AWS, was that an EBS problem in one AZ was able to degrade services in other AZs. My general understanding is that the issue arose when EBS in one AZ got confused about possible network outages and started replicating degraded EBS volumes around the downed AZ. In a very degenerate case, when EBS can’t find any peers in their home AZ, they attempted to reach across other AZs to replicate data to other peers. This led to EBS being totally unavailable in the originally problematic AZ and degrading network-based services in other AZs as this degenerate case was triggered.

All this being said, what is so shocking about this banter is that startups around the globe were essentially blaming a hard drive manufacturer for taking down their sites. I don’t believe I’ve ever heard of a startup blaming NetApp or Seagate for an outage in their hosted environments. People building on the cloud shouldn’t get a pass for poor architectural decisions that put too much emphasis on, essentially, network attached RAID1 storage saving their asses in an outage.

Which brings me to my main point: the cloud is not a silver bullet. S3, EC2, EBS, ELB, et al have their strengths and weaknesses, just like any piece of hardware you’d buy from a traditionally enterprise vendor. Additionally, the cloud has wholly unique architectural characteristics. What I mean by this is that the tools AWS provides you need to be assembled and leveraged in unique ways that differ, sometimes greatly, to how you’d build a system if you were building out your own datacenter.

The #1 characteristic that people fail over and over to recognize is that the cloud is entirely ephemeral. Everything from EBS, to EC2, to EBS volumes could vaporize in an instant. If you are not building your system with this in mind, you’ve already lost. When you’re dealing with an environment like this, which by the way most large hosted services deal with, your architecture needs to exhibit a few key characteristics:

  • Everything needs to be automated. Spinning up new instances, expanding your clusters, backups, restoring from backups, metrics, monitoring, configurations, deployments, etc. should all be automated.
  • You must build share-nothing services that span AZs at a minimum. Preferably your services should span regions as well, which is technically more difficult to implement, but will increase your availability by an order of magnitude.
  • An avoidance of relying on ACID services. It’s not that you can’t run MySQL, PostgreSQL, etc. on the cloud, but the ephemeral and distributed nature of the cloud make this a much more difficult feature to sustain.
  • Data must be replicated across multiple types of storage. If you run MySQL on top of RDS, you should be replicating to slaves on EBS, RDS multi-AZ slaves, ephemeral drives, etc. Additionally, snapshots and backups should span regions. This allows entire components to disappear and you to either continue to operate or restore quickly even if a major AWS service is down.
  • Application-level replication strategies. To truly go multi-region, or to span across cloud services, you’ll very likely have to build replication strategies into your application rather than relying those inherent in your storage systems.

Many people point out that Amazon’s SLA for it’s AWS products is “only” 99.9% What they fail to recognize is those are per-AZ numbers. You can compound your 9’s in a positive manner by spanning multiple AZs. This means that the simple act of spanning two AZs takes you from 99.9% uptime on EC2 to 99.9999% If you went to three AZs, with an appropriate share-nothing architecture, you’d be looking at a theoretical 99.9999999% uptime. That’s 0.03 seconds of downtime a year.

This all being said, I think it is fair to give Amazon a hard time for a couple of things. Personally, I think the majority of the backlash this outage caused was driven by a few factors:

  1. Amazon and the AWS team need to more clearly communicate, and reinforce that communication, about the fallibility of each of their services. For too long, Amazon has let the myth run rampant that the cloud is a magic silver bullet and it’s time to pull back on that messaging. I don’t think Amazon has done this with malice, rather they’re a victim of their own success. The reality is that people are building fantastic systems with outrageously impressive redundancy on top of AWS, but it’s not being clearly articulated that these systems are due to a combination of great tools and great architecture – not magic cloud pixy dust.
  2. Amazon’s community outreach and education needs to be much stronger than it is. They should have a small army of AWS specialists preaching proper cloud architecture at every hackathon, developer day, and conference.
  3. A lot more effort needs to go into documenting proper cloud architecture. The cloud has changed the game. There are new tools and, as a result, new ways of building systems in the cloud. Case studies, diagrams, approved tools, etc. should all be highlighted, documented, and preached about accordingly.

So what have we learned? The cloud isn’t a silver bullet. You still have to build proper redundancy into your systems and applications. And, most importantly, you, not Amazon, is ultimately responsible for your system’s uptime.

Distributed vs. Fault Tolerant Systems

I’ve been researching implementations of distributed search technology for some things we want to do at SimpleGeo. It was about this time that I ran across a project called Katta, which is a “distributed” implementation of Lucene indexes. While perusing the documentation I ran across a diagram detailing the architecture of Katta. What struck me as odd was that Katta was purporting to be a distributed system, yet had a master/slave setup managing it. This led me to tweet out:

Dear Engineers, It’s not a “distributed” system if you have a master/slave setup. That’s simply a “redundant” system. #

Not long after this I got a few inquiries along with some great input from the ever-wise Jan Lehnardt from the CouchDB project along with Christopher Brown:

RT @joestump: “…It’s not a ‘distributed’ system if you have a master/slave setup. That’s simply ‘redundant’” — Same: distributed != sharding #

And it’s not “distributed” if it’s all hub-n-spokes.  Might as well have a central server with some extra storage. #

Basically, Jan and I are making the argument that redundancy or sharding/partitioning doesn’t really add up to a truly distributed system. Well, not in the sense of what I think of when I think of distributed. Going back to my old friend, the CAP Theorem, I think we can see why.

Redundancy could be argued to be the A in CAP; always available. I’d even argue that partitioning, in the sense that most people think of (e.g. partitioning 300m users across 100 MySQL servers), is the A in CAP. The P for partitioning in CAP is that a system is highly tolerant to network partitioning. Because of the master/slave setup of Katta, it really only implements the A in CAP.

Of course, CAP says that you can only have two in any distributed system. I’d make the argument that no system is truly distributed unless it fully implements two of the three properties in CAP. I’d further argue that if you had a highly available (A), highly consistent system (C) even it wouldn’t be distributed (lacking the ability to be highly tolerant to network partitioning).

The problem that I have with Katta’s definition of distributed is that I can’t truly distribute Katta in the raw sense of the word. For instance, I can’t spread Katta across three data centers and not worry about a data center going down. If I lose access to the master then Katta is worthless to me.

To have a truly distributed system I’d argue you need the following characteristics in your software:

  • There can be no master (hub).
  • There must be multiple copies of data spread across multiple nodes in multiple data centers.
  • Software and systems must be tolerant of data center failures.

My point is, I think the word “distributed” is applied too freely to systems that are, more appropriately, called “redundant” or “fault tolerant”.

UPDATE: Seems I confused some people about my master/slave argument. You can build distributed systems that have master/slave pieces in the puzzle. Large distributed systems are rarely one cohesive piece of software, rather they tend to be a series of systems glued together. What I am saying is that a simple master/slave system is not distributed. Through a series of bolts, glue, add-ons, and duct tape you can make a master/slave system sufficiently distributed, but don’t confuse the totality of that system with the base master/slave system.

It's not the language stupid

I’ve said it once, I’ve said it twice, I’ve screamed it from the top of mountains and yet nobody listens. I’m sitting in a session at the MySQL Conference and the person presenting just said, “You have to have well written code to avoid bottlenecks.” This is, put bluntly, stupid and patently false. Let me explain.

  • Your true bottlenecks when scaling are very rarely, if ever, because of your language. Sure Ruby is slower than PHP or Perl or Python, but only incrementally so and it’s only going to get faster. Even if your language is your problem it’s the easiest part of your architecture to scale; add more hardware.
  • Just because your code is well written doesn’t mean it will perform well and, conversely, just because you write shitty code doesn’t mean your code will not perform well. I’ve seen some seriously shitty PHP code that’s blazing fast because it’s so simple.
  • Depending on your application, as you grow you’ll find that your scaling issues come down to one fundamental problem: I/O. DB I/O, file system / disk I/O, network traffic, etc, etc. Ask anyone who’s written a large scale application where their growing pains were and I’ll bet my last dollar it wasn’t “PHP/Python/Ruby/Perl/Java/COBOL is slow”. I’m betting they’ll say something along the lines of “MySQL took a crap on us after we hit 200,000,000 records and had to do date range scans.” Or they’ll say, “I was storing user generated content and NFS couldn’t scale to the amount of requests for that content.”

I’m sick and tired of the language zealots who say PHP is slower than Perl or Ruby is slower than PHP or Java  sucks because which language you’re using has zero to do with that missing index on your table or the fact that you can’t store all of that user generated content.

It comes down to your architecture and, despite what the zealots would have you believe, the language you choose is only one component of your overall architecture. Choose what you know and run with it.