The Cloud is not a Silver Bullet

There has been much brouhaha over the recent EBS outage. I say EBS to be specific as this, despite the sensationalistic headlines, was not an AWS or EC2 outage. RDS was also affected as it is built on top of the EBS product. As has been reported just about everywhere, the outage affected many large websites that are built on top of AWS. Much of the banter was mainly around how the “AWS outage” had “brought down” these sites, which couldn’t be further from the truth. What brought down these services was poor architectural choices combined with a lack of understanding of the unique characteristics of cloud infrastructure.

Our friends at Twilio and SmugMug, along with SimpleGeo, were able to happily weather the storm without significant issues. Why were we able to survive the storm while others were not? It’s simple; anyone who’s spent a lot of time talking with the fine folks at AWS or researching the EBS product would know that it has a number of drawbacks:

  • EBS volumes are not anywhere near as fault tolerant as S3. S3 has 11 9’s of durability. EBS volumes are no more or less durable than a simple RAID1.
  • EBS volumes increase network IO to and from your instance to some degree and can degrade or interfere with other network services on a given instance.
  • EBS volumes lack the IO characteristics that many IO-bound services require. I’ve heard insane stories of people doing RAID0 across dozens of EBS volumes to get the IO performance they’re looking for. At SimpleGeo, we RAID0 the ephemeral drives and split data across multiple ephemeral RAID0 volumes to get the IO we’re looking for.

This doesn’t mean that you should avoid EBS or that EBS is a bad product. Quite the contrary. I think EBS is a pretty fucking cool product when you consider the features and simplicity of it. What it does mean is that you need to know it’s weaknesses in order to properly build a solution on top of it.

I think the largest surprise of the EBS outage, and one that no doubt will be reviewed in depth by the engineers at AWS, was that an EBS problem in one AZ was able to degrade services in other AZs. My general understanding is that the issue arose when EBS in one AZ got confused about possible network outages and started replicating degraded EBS volumes around the downed AZ. In a very degenerate case, when EBS can’t find any peers in their home AZ, they attempted to reach across other AZs to replicate data to other peers. This led to EBS being totally unavailable in the originally problematic AZ and degrading network-based services in other AZs as this degenerate case was triggered.

All this being said, what is so shocking about this banter is that startups around the globe were essentially blaming a hard drive manufacturer for taking down their sites. I don’t believe I’ve ever heard of a startup blaming NetApp or Seagate for an outage in their hosted environments. People building on the cloud shouldn’t get a pass for poor architectural decisions that put too much emphasis on, essentially, network attached RAID1 storage saving their asses in an outage.

Which brings me to my main point: the cloud is not a silver bullet. S3, EC2, EBS, ELB, et al have their strengths and weaknesses, just like any piece of hardware you’d buy from a traditionally enterprise vendor. Additionally, the cloud has wholly unique architectural characteristics. What I mean by this is that the tools AWS provides you need to be assembled and leveraged in unique ways that differ, sometimes greatly, to how you’d build a system if you were building out your own datacenter.

The #1 characteristic that people fail over and over to recognize is that the cloud is entirely ephemeral. Everything from EBS, to EC2, to EBS volumes could vaporize in an instant. If you are not building your system with this in mind, you’ve already lost. When you’re dealing with an environment like this, which by the way most large hosted services deal with, your architecture needs to exhibit a few key characteristics:

  • Everything needs to be automated. Spinning up new instances, expanding your clusters, backups, restoring from backups, metrics, monitoring, configurations, deployments, etc. should all be automated.
  • You must build share-nothing services that span AZs at a minimum. Preferably your services should span regions as well, which is technically more difficult to implement, but will increase your availability by an order of magnitude.
  • An avoidance of relying on ACID services. It’s not that you can’t run MySQL, PostgreSQL, etc. on the cloud, but the ephemeral and distributed nature of the cloud make this a much more difficult feature to sustain.
  • Data must be replicated across multiple types of storage. If you run MySQL on top of RDS, you should be replicating to slaves on EBS, RDS multi-AZ slaves, ephemeral drives, etc. Additionally, snapshots and backups should span regions. This allows entire components to disappear and you to either continue to operate or restore quickly even if a major AWS service is down.
  • Application-level replication strategies. To truly go multi-region, or to span across cloud services, you’ll very likely have to build replication strategies into your application rather than relying those inherent in your storage systems.

Many people point out that Amazon’s SLA for it’s AWS products is “only” 99.9% What they fail to recognize is those are per-AZ numbers. You can compound your 9’s in a positive manner by spanning multiple AZs. This means that the simple act of spanning two AZs takes you from 99.9% uptime on EC2 to 99.9999% If you went to three AZs, with an appropriate share-nothing architecture, you’d be looking at a theoretical 99.9999999% uptime. That’s 0.03 seconds of downtime a year.

This all being said, I think it is fair to give Amazon a hard time for a couple of things. Personally, I think the majority of the backlash this outage caused was driven by a few factors:

  1. Amazon and the AWS team need to more clearly communicate, and reinforce that communication, about the fallibility of each of their services. For too long, Amazon has let the myth run rampant that the cloud is a magic silver bullet and it’s time to pull back on that messaging. I don’t think Amazon has done this with malice, rather they’re a victim of their own success. The reality is that people are building fantastic systems with outrageously impressive redundancy on top of AWS, but it’s not being clearly articulated that these systems are due to a combination of great tools and great architecture – not magic cloud pixy dust.
  2. Amazon’s community outreach and education needs to be much stronger than it is. They should have a small army of AWS specialists preaching proper cloud architecture at every hackathon, developer day, and conference.
  3. A lot more effort needs to go into documenting proper cloud architecture. The cloud has changed the game. There are new tools and, as a result, new ways of building systems in the cloud. Case studies, diagrams, approved tools, etc. should all be highlighted, documented, and preached about accordingly.

So what have we learned? The cloud isn’t a silver bullet. You still have to build proper redundancy into your systems and applications. And, most importantly, you, not Amazon, is ultimately responsible for your system’s uptime.

Why I’ll never own another server

When Matt and I started SimpleGeo I made a decision early on to use Amazon’s AWS services to run our infrastructure. A lot of people basically think I’m nuts for a lot of reasons for this, but I generally get two major questions/concerns when I mention that we run on AWS/EC2.

  1. AWS is slow!
  2. AWS is expensive!

I’ve covered IO performance on EC2 in-depth before and have compared the IO benchmarks, favorably, against numbers from Digg and Media Temple’s systems engineers. The notion that AWS is too slow for your application is, largely, not supported by the numbers and comparisons. The second point I often make with regards to performance on AWS is that Amazon uses this to run large portions of their own infrastructure. Trust me, if it’s good enough for the largest online retailer in the world, it’s good enough for you.

The second point is a bit harder to defend sometimes. Amazon’s AWS can be cheaper than running your own hardware and vice versus. If you run huge amounts of servers AWS can be a few hundred thousand more by comparison on raw numbers that compare cost of your own hardware to cost of AWS. The problem with this vanilla comparison is it forgets one extremely important cost for startups – opportunity cost.

I have a few rhetorical questions as to why people are not using AWS.

  • How many people does it take to maintain your own DC? People have to wrangle hardware, travel around to various DCs, RMA hardware, etc. If they weren’t doing those things, or you didn’t need those people, what could you be doing with those resources if they weren’t wiring your DC?
  • How much time, money, effort, and overhead is it going to take to create multiple data centers? Have you negotiated bandwidth contracts before? Do they have power from multiple providers? Do they have power and bandwidth failover? Amazon has amazing economies of scales and has spent thousands of man hours (years?) preparing for power/bandwidth failover, floods/natural disasters, etc.
  • Managing multiple data centers requires a small army of highly trained network operations people. Have you built DC failover before? Have you implemented load balancing across multiple DCs? It took me about 30 minutes to set up an Elastic Load Balancer that spread traffic across three Availability Zones (Amazon’s term for DCs).
  • Have you thought about building your own automation and self-service APIs for the DC you want to build? Fabric/Chef/Puppet/Capistrano combined with AWS’s automation API is an extremely potent combination for automating large clusters. For instance, we use Fabric and Boto to automate the creation of all nodes in our cluster. I can run a command in Fabric that creates an API server out of thin air, bootstraps it, and puts it into our ELB. This takes about five or so minutes.
  • Have you ever set up a DC in Europe? What about Asia? Would you even know where to start? I can spin up a server in Europe in a matter of seconds. How much might you spend on flying your network operations folks to and fro all of these DCs you plan on building?

These are just a few of the nooks and crannies that people often forget when comparing running their own data centers that I think are extremely important. The two biggest costs, in my opinion, that people forget are opportunity cost and cost of creating automation systems.

To expound a bit on opportunity cost, I’d like to quote the ever-thoughtful Joi Ito.

“If you want to increase the pace of innovation, you need to lower the cost of failure.” — Joi Ito

I can fire up an entire DC for SimpleGeo with a 20-30 node cluster with a few commands, totally automated, run load/consumption/system tests against it, find flaws in my system, and iterate in a matter of hours at a cost of a few hundred dollars.

The simple fact is, SimpleGeo wouldn’t be anywhere near as robust, indeed it might not even exist, as it is without leveraging the cloud.

NoSQL vs. RDBMS: Let the flames begin!

I’ve been getting solidly flamed recently, as have my former coworkers at Digg, my friends at Twitter, etc. about our adoption and promotion of various NoSQL storage systems. It seems that some DBAs are very, very upset that us internet kids are considering abandoning SQL’s ship. I’m not here to throw out a bunch of insane numbers, benchmarks, or flame back, but I did want to point out why SimpleGeo and others are jumping onto the NoSQL bandwagon.

First, and foremost, I haven’t heard of anyone saying MySQL or PostgreSQL on comparable hardware is faster than NoSQL options. The best I’ve heard is that MS SQL setups on SSD drives with lots of RAM could do 6,100 result sets a second. I guess, based on these posts, I’d like to ask a few questions to the people who honestly think RDBMSs can compete with NoSQL solutions at large scale.

  • Do you honestly think that the PhDs at Google, Amazon, Twitter, Digg, and Facebook created Cassandra, BigTable, Dynamo, etc. when they could have just used a RDBMS instead?
  • Has anyone ran RDBMS benchmarks with highly heterogeneous datasets with lots of varying indexes on them? At Digg we had probably a hundred or so tables, each table had varying indexes (a char here, an integer there, a date+time here). Disk IO becomes a serious problem when indexes for different tables are stored on different parts of disks and you have concurrent reads/writes. I know that people have found ways around this, such as 37Signals systems guy putting 15 x 15k RPM drives on his DB server. Assuming $500 a disk (15k disks range from $300 to $800 on Newegg) that’s $7,500 just for disks.
  • Anyone out there running an EC2 large instance with a RDBMS on it that’s doing 1,800 reads/second? I’ve got a Cassandra node that was getting hammered with a load of 6 serving that much traffic without falling over, which I think is pretty decent when you consider each node could easily do that and adding more nodes to handle more load is trivial.
  • How much are you spending on those MS SQL servers with SSD drives that serve up 6,100 results a second? MS SQL is $5,999 per processor. Windows Server 2008 is another $1029. Decent 128GB SSDs appear to cost around $450 each. You see where I’m going with this. Nobody is arguing you can’t get RDBMSs to scale up to a few thousand reads/writes a second if you can afford to spend $50,000 or $100,000 per server. The problem is that very few startups can spend that much money on a single server.
  • How much time are your DBAs spending administering your RDBMSs? How much time are they in the data centers? How much do those data centers cost? How much do DBAs cost a year? Let’s say you have 10 monster DB servers and 1 DBA; you’re looking at about $500,000 in database costs.
  • How easy is it to add a new server to your cluster? If we identify a hot spot in our Cassandra cluster, we can have a new node bootstrapped into our cluster in about five minutes. And I mean it’s in production taking writes and serving reads.
  • Does your RDBMS automatically rebalance the entire cluster when a new node is bootstrapped into it?
  • I’m running a 50 node cluster, which spans three data centers, on Amazon’s EC2 service for about $10,000 a month. Furthermore, this is an operational expense as opposed to a capital expense, which is a bit nicer on the books. In order to scale a RDBMS to 6,000 reads/second I’d need to spend on the order of five months of operation of my 50 node cluster.
  • Has anyone ran benchmarks with MySQL or PostgreSQL in an environment that sees 35,000 requests a second? IO contention becomes a huge issue when your stack needs to serve that many requests simultaneously. I know of one company that’s managing to scale portions of their PostgreSQL servers by purchasing $250,000 servers. This would cover my 50 node EC2 cluster for two years.

I guess what I’m saying is that my decision to use NoSQL, and I’m guessing others’ decisions to do so, has less to do with the fact that we can’t squeeze a few thousand writes a second out of MySQL and more to do with management and cost overhead. NoSQL solutions allow us to serve absurd amounts of data for a really, really low price. I’m happy to put my $/write, $/read, and $/GB numbers for my NoSQL setup against anyone’s RDBMS numbers.

We’re not nearly as dumb as everyone thinks we are; I promise.

Disk IO and throughput benchmarks on Amazon’s EC2

When I told people that we were going to run our infrastructure on Amazon’s EC2 most people recoiled in disgust. I heard lots and lots of horror stories about how you simply couldn’t run production environments on EC2. Disk IO was horrible, throughput was bad, etc. Someone should seriously tell Amazon, since a lot of their own infrastructure runs in EC2 and AWS. For the most part, Amazon’s AWS were internal tools that they released publicly.

We’ve been ironing out kinks in our production environment for the last few weeks and one of the things that worried me was if these assertions were true. So, I set out to run a fairly comprehensive test of Disk IO and throughput. I ran hdparm -t, bonnie++, and iozone against ephemeral drives in various configurations along with EBS volumes in various configurations.

For all of my tests I tested the regular ephemeral drives as they were installed (“Normal”), LVM in a JBOD setup (“LVM”), the two ephemeral drives in a software RAID0 setup (“RAID0”), a single 100GB EBS volume (“EBS”), and two 100GB EBS volumes in a software RAID0 setup (“EBS RAID0”). All of the tests were ran on large instances (m1.large). All tests were ran using the XFS file system.

hdparm -t

EC2_RAID_JBODWhile hdparm -t isn’t the most comprehensive test in the world, I think it’s a decent gut check for simple throughput. To give a little context, I remember Digg’s production servers, depending on setup, ranging from 180MB/sec. to 340MB/sec. I’m guessing if you upgraded to the XL instance type and did a RAID0 across the four drives it has you’d see even better numbers with the RAID0 ephemeral drives.

What I also found pretty interesting about these numbers is that the EBS volumes stacked up “okay” against the ephemeral drives and that the EBS volumes in a RAID0 didn’t gain us a ton of throughput. Considering that the EBS volumes run over the network, which I assume is gigabit ethernet, 94MB/sec. is pretty much saturating that network connection and, to say the least, impressive given the circumstances.

For most applications, I’d guess that EBS throughput is just fine. The raw throughput only becomes a serious requirement when you’re moving around lots of large files. Most applications move around lots of small files, which I’d likely use S3 for anyways. If your application needs to move around lots of large files I’d consider using a RAID0 ephemeral drive setup with redundancy (e.g. MogileFS spreading files across many nodes in various data centers).

bonnie++

20091209-cgq7p1my7mtm5wmsyrjpas5xqfDisregarding the Input/Output Block performance of the ephemeral RAID0 setup here, it’s extremely interesting to note that EBS IO performance is better than the ephemeral drives and that EBS in a RAID0 was better in almost every metric as the ephemeral drives.

That all being said, RAID0 ephemeral drives are the clear winner here. I do wonder, however, if you could set up a RAID0 EBS array that had, say, four or six or eight volumes that’d be faster than the RAID0 setup.

If your application is IO bound then I’d probably recommend using EBS volumes if you can afford it. Otherwise, I’d use RAID0. Again, the trick with the ephemeral drives is to ensure your data is replicated across multiple nodes. Of course, this is cloud computing we’re talking about, so you should be doing that anyways.

EC2_RAID_JBOD-1Here’s the CPU numbers of the various configurations. One thing to note here is that EBS, LVM, and software RAID all come with CPU costs. Somewhat interesting to note is that the EBS has substantially less CPU usage in all areas except Input/Output Per Char.

If your application is both CPU and IO bound then I’d probably recommend upgrading your instance to an XL.

20091209-8kybaf7peu91m3puugxpr78reuThe last bonnie++ results are the random seeks per second and, wow, was I surprised. A single EBS runs pretty much dead even with the LVM JBOD and the EBS RAID0 is on par with the RAID0 ephemeral drives.

To say I was surprised by these numbers would be an understatement. The lesson here is that, if your application does lots of random seeking, you’ll want to use either EBS RAID0 volumes or RAID0 ephemeral drives.

iozone

Before running these tests I’d never even heard of this application, but it seemed to be used by quite a few folks so I thought I’d give it a shot.

20091209-ejjcihfsn16hjdnrnfy1g2ptai

Again, some interesting numbers from EBS volumes. What I found pretty interesting here is that the EBS RAID0 setups actually ended up being slower in a few metrics than a single EBS volume. No idea why that may be.

The other thing to note is that the single EBS volume outperformed the ephemeral RAID0 setup in a few different metrics, most notably being random writes.

Conclusions

I think the overall conclusion here is that disk IO and throughput on EC2 is pretty darn good. I have a few other conclusions as well.

  • If you can replicate your data across multiple nodes then the RAID0 ephemeral drives are the clear winners.
  • If you are just looking to store lots of small files and serve them up, then definitely use S3 with CloudFront.
  • You’d very likely get even more impressive numbers using the XL instances with RAID0 striped across four ephemeral drives.
  • Another potential disk setup would be to put different datasets on different ephemeral drives. For instance, put one MySQL database on one ephemeral drive and your other one on another.
  • If your setups are IO bound and you’re looking for lots of redundancy, then EBS volumes are likely the way to go. If you don’t need the super redundancy on a single box then use RAID0 on ephemeral drives.