When Matt and I started SimpleGeo I made a decision early on to use Amazon’s AWS services to run our infrastructure. A lot of people basically think I’m nuts for a lot of reasons for this, but I generally get two major questions/concerns when I mention that we run on AWS/EC2.
- AWS is slow!
- AWS is expensive!
I’ve covered IO performance on EC2 in-depth before and have compared the IO benchmarks, favorably, against numbers from Digg and Media Temple’s systems engineers. The notion that AWS is too slow for your application is, largely, not supported by the numbers and comparisons. The second point I often make with regards to performance on AWS is that Amazon uses this to run large portions of their own infrastructure. Trust me, if it’s good enough for the largest online retailer in the world, it’s good enough for you.
The second point is a bit harder to defend sometimes. Amazon’s AWS can be cheaper than running your own hardware and vice versus. If you run huge amounts of servers AWS can be a few hundred thousand more by comparison on raw numbers that compare cost of your own hardware to cost of AWS. The problem with this vanilla comparison is it forgets one extremely important cost for startups – opportunity cost.
I have a few rhetorical questions as to why people are not using AWS.
- How many people does it take to maintain your own DC? People have to wrangle hardware, travel around to various DCs, RMA hardware, etc. If they weren’t doing those things, or you didn’t need those people, what could you be doing with those resources if they weren’t wiring your DC?
- How much time, money, effort, and overhead is it going to take to create multiple data centers? Have you negotiated bandwidth contracts before? Do they have power from multiple providers? Do they have power and bandwidth failover? Amazon has amazing economies of scales and has spent thousands of man hours (years?) preparing for power/bandwidth failover, floods/natural disasters, etc.
- Managing multiple data centers requires a small army of highly trained network operations people. Have you built DC failover before? Have you implemented load balancing across multiple DCs? It took me about 30 minutes to set up an Elastic Load Balancer that spread traffic across three Availability Zones (Amazon’s term for DCs).
- Have you thought about building your own automation and self-service APIs for the DC you want to build? Fabric/Chef/Puppet/Capistrano combined with AWS’s automation API is an extremely potent combination for automating large clusters. For instance, we use Fabric and Boto to automate the creation of all nodes in our cluster. I can run a command in Fabric that creates an API server out of thin air, bootstraps it, and puts it into our ELB. This takes about five or so minutes.
- Have you ever set up a DC in Europe? What about Asia? Would you even know where to start? I can spin up a server in Europe in a matter of seconds. How much might you spend on flying your network operations folks to and fro all of these DCs you plan on building?
These are just a few of the nooks and crannies that people often forget when comparing running their own data centers that I think are extremely important. The two biggest costs, in my opinion, that people forget are opportunity cost and cost of creating automation systems.
To expound a bit on opportunity cost, I’d like to quote the ever-thoughtful Joi Ito.
“If you want to increase the pace of innovation, you need to lower the cost of failure.” — Joi Ito
I can fire up an entire DC for SimpleGeo with a 20-30 node cluster with a few commands, totally automated, run load/consumption/system tests against it, find flaws in my system, and iterate in a matter of hours at a cost of a few hundred dollars.
The simple fact is, SimpleGeo wouldn’t be anywhere near as robust, indeed it might not even exist, as it is without leveraging the cloud.