I’ve been getting solidly flamed recently, as have my former coworkers at Digg, my friends at Twitter, etc. about our adoption and promotion of various NoSQL storage systems. It seems that some DBAs are very, very upset that us internet kids are considering abandoning SQL’s ship. I’m not here to throw out a bunch of insane numbers, benchmarks, or flame back, but I did want to point out why SimpleGeo and others are jumping onto the NoSQL bandwagon.
First, and foremost, I haven’t heard of anyone saying MySQL or PostgreSQL on comparable hardware is faster than NoSQL options. The best I’ve heard is that MS SQL setups on SSD drives with lots of RAM could do 6,100 result sets a second. I guess, based on these posts, I’d like to ask a few questions to the people who honestly think RDBMSs can compete with NoSQL solutions at large scale.
- Do you honestly think that the PhDs at Google, Amazon, Twitter, Digg, and Facebook created Cassandra, BigTable, Dynamo, etc. when they could have just used a RDBMS instead?
- Has anyone ran RDBMS benchmarks with highly heterogeneous datasets with lots of varying indexes on them? At Digg we had probably a hundred or so tables, each table had varying indexes (a char here, an integer there, a date+time here). Disk IO becomes a serious problem when indexes for different tables are stored on different parts of disks and you have concurrent reads/writes. I know that people have found ways around this, such as 37Signals systems guy putting 15 x 15k RPM drives on his DB server. Assuming $500 a disk (15k disks range from $300 to $800 on Newegg) that’s $7,500 just for disks.
- Anyone out there running an EC2 large instance with a RDBMS on it that’s doing 1,800 reads/second? I’ve got a Cassandra node that was getting hammered with a load of 6 serving that much traffic without falling over, which I think is pretty decent when you consider each node could easily do that and adding more nodes to handle more load is trivial.
- How much are you spending on those MS SQL servers with SSD drives that serve up 6,100 results a second? MS SQL is $5,999 per processor. Windows Server 2008 is another $1029. Decent 128GB SSDs appear to cost around $450 each. You see where I’m going with this. Nobody is arguing you can’t get RDBMSs to scale up to a few thousand reads/writes a second if you can afford to spend $50,000 or $100,000 per server. The problem is that very few startups can spend that much money on a single server.
- How much time are your DBAs spending administering your RDBMSs? How much time are they in the data centers? How much do those data centers cost? How much do DBAs cost a year? Let’s say you have 10 monster DB servers and 1 DBA; you’re looking at about $500,000 in database costs.
- How easy is it to add a new server to your cluster? If we identify a hot spot in our Cassandra cluster, we can have a new node bootstrapped into our cluster in about five minutes. And I mean it’s in production taking writes and serving reads.
- Does your RDBMS automatically rebalance the entire cluster when a new node is bootstrapped into it?
- I’m running a 50 node cluster, which spans three data centers, on Amazon’s EC2 service for about $10,000 a month. Furthermore, this is an operational expense as opposed to a capital expense, which is a bit nicer on the books. In order to scale a RDBMS to 6,000 reads/second I’d need to spend on the order of five months of operation of my 50 node cluster.
- Has anyone ran benchmarks with MySQL or PostgreSQL in an environment that sees 35,000 requests a second? IO contention becomes a huge issue when your stack needs to serve that many requests simultaneously. I know of one company that’s managing to scale portions of their PostgreSQL servers by purchasing $250,000 servers. This would cover my 50 node EC2 cluster for two years.
I guess what I’m saying is that my decision to use NoSQL, and I’m guessing others’ decisions to do so, has less to do with the fact that we can’t squeeze a few thousand writes a second out of MySQL and more to do with management and cost overhead. NoSQL solutions allow us to serve absurd amounts of data for a really, really low price. I’m happy to put my $/write, $/read, and $/GB numbers for my NoSQL setup against anyone’s RDBMS numbers.
We’re not nearly as dumb as everyone thinks we are; I promise.