Why I’ll never own another server

When Matt and I started SimpleGeo I made a decision early on to use Amazon’s AWS services to run our infrastructure. A lot of people basically think I’m nuts for a lot of reasons for this, but I generally get two major questions/concerns when I mention that we run on AWS/EC2.

  1. AWS is slow!
  2. AWS is expensive!

I’ve covered IO performance on EC2 in-depth before and have compared the IO benchmarks, favorably, against numbers from Digg and Media Temple’s systems engineers. The notion that AWS is too slow for your application is, largely, not supported by the numbers and comparisons. The second point I often make with regards to performance on AWS is that Amazon uses this to run large portions of their own infrastructure. Trust me, if it’s good enough for the largest online retailer in the world, it’s good enough for you.

The second point is a bit harder to defend sometimes. Amazon’s AWS can be cheaper than running your own hardware and vice versus. If you run huge amounts of servers AWS can be a few hundred thousand more by comparison on raw numbers that compare cost of your own hardware to cost of AWS. The problem with this vanilla comparison is it forgets one extremely important cost for startups – opportunity cost.

I have a few rhetorical questions as to why people are not using AWS.

  • How many people does it take to maintain your own DC? People have to wrangle hardware, travel around to various DCs, RMA hardware, etc. If they weren’t doing those things, or you didn’t need those people, what could you be doing with those resources if they weren’t wiring your DC?
  • How much time, money, effort, and overhead is it going to take to create multiple data centers? Have you negotiated bandwidth contracts before? Do they have power from multiple providers? Do they have power and bandwidth failover? Amazon has amazing economies of scales and has spent thousands of man hours (years?) preparing for power/bandwidth failover, floods/natural disasters, etc.
  • Managing multiple data centers requires a small army of highly trained network operations people. Have you built DC failover before? Have you implemented load balancing across multiple DCs? It took me about 30 minutes to set up an Elastic Load Balancer that spread traffic across three Availability Zones (Amazon’s term for DCs).
  • Have you thought about building your own automation and self-service APIs for the DC you want to build? Fabric/Chef/Puppet/Capistrano combined with AWS’s automation API is an extremely potent combination for automating large clusters. For instance, we use Fabric and Boto to automate the creation of all nodes in our cluster. I can run a command in Fabric that creates an API server out of thin air, bootstraps it, and puts it into our ELB. This takes about five or so minutes.
  • Have you ever set up a DC in Europe? What about Asia? Would you even know where to start? I can spin up a server in Europe in a matter of seconds. How much might you spend on flying your network operations folks to and fro all of these DCs you plan on building?

These are just a few of the nooks and crannies that people often forget when comparing running their own data centers that I think are extremely important. The two biggest costs, in my opinion, that people forget are opportunity cost and cost of creating automation systems.

To expound a bit on opportunity cost, I’d like to quote the ever-thoughtful Joi Ito.

“If you want to increase the pace of innovation, you need to lower the cost of failure.” — Joi Ito

I can fire up an entire DC for SimpleGeo with a 20-30 node cluster with a few commands, totally automated, run load/consumption/system tests against it, find flaws in my system, and iterate in a matter of hours at a cost of a few hundred dollars.

The simple fact is, SimpleGeo wouldn’t be anywhere near as robust, indeed it might not even exist, as it is without leveraging the cloud.

NoSQL vs. RDBMS: Let the flames begin!

I’ve been getting solidly flamed recently, as have my former coworkers at Digg, my friends at Twitter, etc. about our adoption and promotion of various NoSQL storage systems. It seems that some DBAs are very, very upset that us internet kids are considering abandoning SQL’s ship. I’m not here to throw out a bunch of insane numbers, benchmarks, or flame back, but I did want to point out why SimpleGeo and others are jumping onto the NoSQL bandwagon.

First, and foremost, I haven’t heard of anyone saying MySQL or PostgreSQL on comparable hardware is faster than NoSQL options. The best I’ve heard is that MS SQL setups on SSD drives with lots of RAM could do 6,100 result sets a second. I guess, based on these posts, I’d like to ask a few questions to the people who honestly think RDBMSs can compete with NoSQL solutions at large scale.

  • Do you honestly think that the PhDs at Google, Amazon, Twitter, Digg, and Facebook created Cassandra, BigTable, Dynamo, etc. when they could have just used a RDBMS instead?
  • Has anyone ran RDBMS benchmarks with highly heterogeneous datasets with lots of varying indexes on them? At Digg we had probably a hundred or so tables, each table had varying indexes (a char here, an integer there, a date+time here). Disk IO becomes a serious problem when indexes for different tables are stored on different parts of disks and you have concurrent reads/writes. I know that people have found ways around this, such as 37Signals systems guy putting 15 x 15k RPM drives on his DB server. Assuming $500 a disk (15k disks range from $300 to $800 on Newegg) that’s $7,500 just for disks.
  • Anyone out there running an EC2 large instance with a RDBMS on it that’s doing 1,800 reads/second? I’ve got a Cassandra node that was getting hammered with a load of 6 serving that much traffic without falling over, which I think is pretty decent when you consider each node could easily do that and adding more nodes to handle more load is trivial.
  • How much are you spending on those MS SQL servers with SSD drives that serve up 6,100 results a second? MS SQL is $5,999 per processor. Windows Server 2008 is another $1029. Decent 128GB SSDs appear to cost around $450 each. You see where I’m going with this. Nobody is arguing you can’t get RDBMSs to scale up to a few thousand reads/writes a second if you can afford to spend $50,000 or $100,000 per server. The problem is that very few startups can spend that much money on a single server.
  • How much time are your DBAs spending administering your RDBMSs? How much time are they in the data centers? How much do those data centers cost? How much do DBAs cost a year? Let’s say you have 10 monster DB servers and 1 DBA; you’re looking at about $500,000 in database costs.
  • How easy is it to add a new server to your cluster? If we identify a hot spot in our Cassandra cluster, we can have a new node bootstrapped into our cluster in about five minutes. And I mean it’s in production taking writes and serving reads.
  • Does your RDBMS automatically rebalance the entire cluster when a new node is bootstrapped into it?
  • I’m running a 50 node cluster, which spans three data centers, on Amazon’s EC2 service for about $10,000 a month. Furthermore, this is an operational expense as opposed to a capital expense, which is a bit nicer on the books. In order to scale a RDBMS to 6,000 reads/second I’d need to spend on the order of five months of operation of my 50 node cluster.
  • Has anyone ran benchmarks with MySQL or PostgreSQL in an environment that sees 35,000 requests a second? IO contention becomes a huge issue when your stack needs to serve that many requests simultaneously. I know of one company that’s managing to scale portions of their PostgreSQL servers by purchasing $250,000 servers. This would cover my 50 node EC2 cluster for two years.

I guess what I’m saying is that my decision to use NoSQL, and I’m guessing others’ decisions to do so, has less to do with the fact that we can’t squeeze a few thousand writes a second out of MySQL and more to do with management and cost overhead. NoSQL solutions allow us to serve absurd amounts of data for a really, really low price. I’m happy to put my $/write, $/read, and $/GB numbers for my NoSQL setup against anyone’s RDBMS numbers.

We’re not nearly as dumb as everyone thinks we are; I promise.

Distributed vs. Fault Tolerant Systems

I’ve been researching implementations of distributed search technology for some things we want to do at SimpleGeo. It was about this time that I ran across a project called Katta, which is a “distributed” implementation of Lucene indexes. While perusing the documentation I ran across a diagram detailing the architecture of Katta. What struck me as odd was that Katta was purporting to be a distributed system, yet had a master/slave setup managing it. This led me to tweet out:

Dear Engineers, It’s not a “distributed” system if you have a master/slave setup. That’s simply a “redundant” system. #

Not long after this I got a few inquiries along with some great input from the ever-wise Jan Lehnardt from the CouchDB project along with Christopher Brown:

RT @joestump: “…It’s not a ‘distributed’ system if you have a master/slave setup. That’s simply ‘redundant’” — Same: distributed != sharding #

And it’s not “distributed” if it’s all hub-n-spokes.  Might as well have a central server with some extra storage. #

Basically, Jan and I are making the argument that redundancy or sharding/partitioning doesn’t really add up to a truly distributed system. Well, not in the sense of what I think of when I think of distributed. Going back to my old friend, the CAP Theorem, I think we can see why.

Redundancy could be argued to be the A in CAP; always available. I’d even argue that partitioning, in the sense that most people think of (e.g. partitioning 300m users across 100 MySQL servers), is the A in CAP. The P for partitioning in CAP is that a system is highly tolerant to network partitioning. Because of the master/slave setup of Katta, it really only implements the A in CAP.

Of course, CAP says that you can only have two in any distributed system. I’d make the argument that no system is truly distributed unless it fully implements two of the three properties in CAP. I’d further argue that if you had a highly available (A), highly consistent system (C) even it wouldn’t be distributed (lacking the ability to be highly tolerant to network partitioning).

The problem that I have with Katta’s definition of distributed is that I can’t truly distribute Katta in the raw sense of the word. For instance, I can’t spread Katta across three data centers and not worry about a data center going down. If I lose access to the master then Katta is worthless to me.

To have a truly distributed system I’d argue you need the following characteristics in your software:

  • There can be no master (hub).
  • There must be multiple copies of data spread across multiple nodes in multiple data centers.
  • Software and systems must be tolerant of data center failures.

My point is, I think the word “distributed” is applied too freely to systems that are, more appropriately, called “redundant” or “fault tolerant”.

UPDATE: Seems I confused some people about my master/slave argument. You can build distributed systems that have master/slave pieces in the puzzle. Large distributed systems are rarely one cohesive piece of software, rather they tend to be a series of systems glued together. What I am saying is that a simple master/slave system is not distributed. Through a series of bolts, glue, add-ons, and duct tape you can make a master/slave system sufficiently distributed, but don’t confuse the totality of that system with the base master/slave system.

MySQL Cluster: New Features and Enhancements

This specifically about new features in the 5.1 release. As much as I like Australia I think it’s retarded to be talking about that at a MySQL conference, though it is funny I suppose.

  1. They now support variable sized rows, which reduces memory usage. This equates to more rows per gigabyte.
  2. Add/Update Index has been optimized over 5.0. Before it copied the entire table, added the new index and then moved it back over the old table. This all happened over the wire so you can imagine how long that took. This optimization speeds things up about four times.
  3. 5.1 now allows you to replicate across clusters. Used for geographical redundancy, split the processing load (why not add more nodes to the cluster?), etc. You have to use the row-based replication to enable this feature.
  4. Failover of replication channels is manual.
  5. They’re adding support for data on disk in 5.1 and indexes on disk in the future.

This absolutely amazes me. The moral of the story at MySQL UC has been that clustering is a nightmare to set up, maintain and use. However, storing data on disk is a step in the right direction. The solution from Continuent seems to be infinitely more elegant than both clustering and replication.

MySQL Replication: New Features and Enhancements

  1. MySQL 5.0 supports auto-increment variables for bi-directional replication to avoid auto-increment collisions.
  2. MySQL 5.0 replicates character sets and time zones.
  3. 5.0 now replicates stored procedures, triggers and views.
  4. MySQL 5.1 introduces row-based logging and replication (RBR)
  5. Dyanmic switching of binary log format between ROW, STATEMENT and MIXED.
  6. 5.1 allows for cluster replication.
  7. Replication method cannot be configured per table, but since it’s dynamic you can change it from the client before a transaction, etc.
  8. With auto_increment_offset and auto_increment_increment you can change starting points and how many you step between. Works with most table types. Works with InnoDB and MyISAM. This is specifically for multi-master setups.
  9. RBR allows clusters to replicate and also allows the server to replicate non-deterministic statements such as LOAD_FILE(). That being said you can’t have different table definitions on the slave as you can with SBR.

Wow, nothing super impressive here that makes me excited about new features in MySQL’s 5.0/5.1 replication. Their answer to the possible auto_increment collisions seems a bit simplistic and short-sighted, but then again I’m not a maintainer for MySQL’s replication code.

What annoys me greatly is MySQL’s refusal to simply add features to storage engines in favor of simply adding new storage engines. MyISAM doesn’t support transactions or foreign key constraints? Use InnoDB. InnoDB doesn’t support FULLTEXT? Use MyISAM. You need synchronous replication of data? Use NDB, but you need to denormalize and reduce your use of JOIN’s.

It’s enough to make me switch to PostgreSQL.

Advanced MySQL Replication with Load Balancing

Today I thought I’d check out session by the CTO of Continuent about their clustering solution. Before heading into the session we checked out Continuent’s booth in the exhibit hall. It certainly sounds like a great product, however the $5,000 per CPU licensing seems a bit Oracle’ish to me.

  1. A share nothing architecture. Split up into two layers; the controllers and the actual database nodes.
  2. Low latency.
  3. Single-copy equivalence for reads.
  4. Supports load balancing and heterogeneous databases (ie. MySQL and SQL Server sharing the same cluster).
  5. 100% Java and based on Sequoia.
  6. They might have an open source version. Will need to check this out.
  7. Fully transparent (including failover) to the application.
  8. The controllers act as a proxy to the database. To the applications they appear as the actual database. Below these controllers is where the databases actually sit.
  9. You need to compile/load your their specific driver, which I think would require a near PEAR DB driver.
  10. Requests come in where it determines what type of request it is (read vs. write). It then broadcasts writes to all controllers. All requests are executed in identical order. From there it’s sent to the scheduler, which makes sure the underlying databases remain in identical states. These requests can be sent in parrallel to each database server.
  11. The request controller then aggregates the responses from the databases. If there was an error across all of the systems then it’s a bad query of some sort, if only a single node respondes with an error then that systems is dropped. This can be configured to return a success message once one of the underlying nodes responds with a success.
  12. Reads are simply load balanced across the underlying nodes to the node with the least number of requests (sounds like LVS’s weighted least connection algorithm).
  13. Works with MyISAM, InnoDB and heap table types.
  14. The commits are synchronous.
  15. Once you have a dump with a starting date you can apply the dump to the new node and then the cluster controllers apply the logs until it’s up to date.
  16. The open source solution only supports ANSI SQL, doesn’t come with their own group talk protocol (so the clusters can communicate with each other) and doesn’t come with the database-specific dump and load.
  17. The cluster controllers keep track of the position in the overall sequence of each underlying node and sends reads only to those nodes that are up-to-date with the current position.
  18. The largest cluster they have are four nodes. Has been tested with as many as 64 nodes and supports tiering.
  19. Failover between cluster controllers happens in the driver level.
  20. Stored procedures where a “challenge”. Their approach allows you to tell their controllers what is inside of the stored procedure (what tables it changes, etc.) so the controller knows how to handle each procedure.

After sitting in on the MySQL Clustering tutorial I can honestly say this approach is infinitely better. It supports disk write, InnoDB, MyISAM and, on top of that, allows you to tier the clusters. The only downside I see is that it’s written in Java, but these days I’m not sure if that’s really a downside.

MySQL Cluster Configuration, Tuning, and Maintenance

Today is the 1st day of MySQL UC and I’m sitting in on the clustering tutorial, which promises to be three hours of information on MySQL’s clustering feature. I’m sure you’re thrilled that I’m, ahem, live blogging this. In reality, this is just so I have a central place for my notes.

  1. Broken up into three parts. The MySQL servers sit separate from the NDB Storage Engine, which are storage nodes (NDB nodes). The third part is called a management server. The management server, oddly enough, isn’t required once the cluster is up and running unless you want to add another storage node.
  2. Memory based storage engine. If you’re not using 5.1+ then you must have enough RAM in each storage node to store the data set. This means that if you have four machines with 4GB of RAM each you can store 8GB of data (16GB of total storage divided by two for two copies of the data set).
  3. Storage nodes are static and pre-allocate resources on startup.
  4. Supports transactions and row level locking.
  5. Should be noted this is a storage engine so you can’t create MyISAM or InnoDB tables inside of a cluster.
  6. Uses fixed sized records. This means if you have a varchar(255) and put a single byte into it that field is still using 255 bytes.
  7. No foreign key constraint support.
  8. Replication across nodes is syncronous across nodes. I assume this means that an INSERT happens once all of the nodes have completed the INSERT. This is different than regular replication which is asyncronous and introduces race conditions.
  9. Tables are divided into fragments (one fragment for each storage node). Each storage node is responsible for each fragment. Each fragment also has a secondary fragment, which is a copy of another node’s primary fragment. This data distribution happens automatically.
  10. NDB takes your primary key, creates a hash and then converts that to a fragment. So you’ll have various rows on each different storage node.
  11. A node group is a set of nodes that share the same fragment information. If you lose an entire node group you’ve lost half of the table and the cluster will not continue to operate. However, if one node in a node group fails there will still be enough data to keep the cluster up and running.
  12. If a node fails and it’s secondary counterpart takes over it will, essentially, have to perform the job of two nodes. Until a node has fully recovered it will not rejoin the cluster.
  13. Backups are hot and non-locking. Each node writes its own set of backup files. No support for incremental backups.
  14. Because it’s memory based you could lose data on a system crash (as you might have transactions sitting in RAM when a crash occurs). The COMMIT command does not write changes to disk meaning that you could have data sitting in memory that’s not on disk when a node crashes. This means the odd truth is that MySQL Clusters support syncronous replication, but are not atomic.
  15. NDB nodes will checkpoint data to disk (data + logs), which are used for system recovery. They write two logs, the UNDO and REDO logs.
  16. They recommend using TRUNCATE to delete all rows from a table.
  17. Modification operations are distributed to both the primary and secondary fragments (obviously).
  18. NDB will run on 64-bit machines. They recommend Dual CPU 64-bit machines. NDB is threaded. Application nodes (MySQL servers) can be whatever.
  19. SCI offers 30-100% better performance over gigabit.
  20. They actually recommend avoiding joins and to denormalize your schemas. Are you kidding me? He actually said “Performance for joins sucks.”

Overall, I’m underwhelmed by MySQL Clustering. You’re limited in storage with the RAM and you can’t optimize your schemas due to fixed field sizes. And any RDBMS “solution” that recommends you denormalize puts me off.

That being said the actual technology is pretty interesting and I suspect that in a few years we’ll see the clustering features in MySQL come into their own. As of now I suspect few people would be able to justify the sacrifices for the gains clustering allows.