Distributed vs. Fault Tolerant Systems

I’ve been researching implementations of distributed search technology for some things we want to do at SimpleGeo. It was about this time that I ran across a project called Katta, which is a “distributed” implementation of Lucene indexes. While perusing the documentation I ran across a diagram detailing the architecture of Katta. What struck me as odd was that Katta was purporting to be a distributed system, yet had a master/slave setup managing it. This led me to tweet out:

Dear Engineers, It’s not a “distributed” system if you have a master/slave setup. That’s simply a “redundant” system. #

Not long after this I got a few inquiries along with some great input from the ever-wise Jan Lehnardt from the CouchDB project along with Christopher Brown:

RT @joestump: “…It’s not a ‘distributed’ system if you have a master/slave setup. That’s simply ‘redundant’” — Same: distributed != sharding #

And it’s not “distributed” if it’s all hub-n-spokes.  Might as well have a central server with some extra storage. #

Basically, Jan and I are making the argument that redundancy or sharding/partitioning doesn’t really add up to a truly distributed system. Well, not in the sense of what I think of when I think of distributed. Going back to my old friend, the CAP Theorem, I think we can see why.

Redundancy could be argued to be the A in CAP; always available. I’d even argue that partitioning, in the sense that most people think of (e.g. partitioning 300m users across 100 MySQL servers), is the A in CAP. The P for partitioning in CAP is that a system is highly tolerant to network partitioning. Because of the master/slave setup of Katta, it really only implements the A in CAP.

Of course, CAP says that you can only have two in any distributed system. I’d make the argument that no system is truly distributed unless it fully implements two of the three properties in CAP. I’d further argue that if you had a highly available (A), highly consistent system (C) even it wouldn’t be distributed (lacking the ability to be highly tolerant to network partitioning).

The problem that I have with Katta’s definition of distributed is that I can’t truly distribute Katta in the raw sense of the word. For instance, I can’t spread Katta across three data centers and not worry about a data center going down. If I lose access to the master then Katta is worthless to me.

To have a truly distributed system I’d argue you need the following characteristics in your software:

  • There can be no master (hub).
  • There must be multiple copies of data spread across multiple nodes in multiple data centers.
  • Software and systems must be tolerant of data center failures.

My point is, I think the word “distributed” is applied too freely to systems that are, more appropriately, called “redundant” or “fault tolerant”.

UPDATE: Seems I confused some people about my master/slave argument. You can build distributed systems that have master/slave pieces in the puzzle. Large distributed systems are rarely one cohesive piece of software, rather they tend to be a series of systems glued together. What I am saying is that a simple master/slave system is not distributed. Through a series of bolts, glue, add-ons, and duct tape you can make a master/slave system sufficiently distributed, but don’t confuse the totality of that system with the base master/slave system.

Year in Review

  1. The year began in Koh Phangan, Thailand with my friend Chris Lea. We spent a month laying on beaches, swinging in hammocks, and drinking booze out of buckets.
  2. While in Thailand I got some more bamboo work done on my left arm.
  3. In February I went down to Miami for Future of Web Apps to talk about scaling your tech teams.
  4. Around my birthday I was able to score a copy of Netscape Navigator 2.0, still in the box, signed by Marc Andreessen.
  5. March brought the usual trip down to Austin, TX for SXSW. I spoke on a panel titled, “Designers and Developers: Why can’t we all just get along?”
  6. In April I attended the Social Foo Camp, which is an invite-only nerdfest put on by O’Reilly.
  7. May was an insane month of travel in a year of insane travel. I spent a week in Michigan, a week in Prague, a day in Phoenix, and a few days in Boulder, CO.
  8. While I was in Michigan, Jonathan and I got our pictures taken by my high school sweetheart, Erica, for our mom for Mother’s Day.
  9. When I returned from Prague I’d made the big decision to leave Digg and build a startup with Matt Galligan. Matt and I created a company called Crash Corp. that was going to build augmented reality and location-based games.
  10. In June I got a new face.
  11. Matt and I agreed to each take a month off to clear our heads before jumping into startup mode. For unknown reasons he decided to spend his month in the Midwest. I, on the other hand, chose to go to Amsterdam, Denmark, Norway, Ireland, and London. This marked my second month off for the year, which was awesome.
  12. I spent about ten days in Norway with my buddy Arne Fismen (Side note: His last name means “fart man” in Norwegian, which is definitely worse than my last name) and was able to fulfill a childhood dream of mine by visiting the world famous fjords of Norway. I can’t express my appreciation enough for what Arne and his family did for me. It was truly a magical experience.
  13. When I returned from Europe I spent a few days in San Francisco before heading down to San Diego for my buddy Dana’s bachelor party.
  14. After Dana’s bachelor party I moved to Boulder, CO to get to work on Matt and I’s company.
  15. Soon after getting on the ground and starting to work through things Matt and I realized we needed to change direction. As a result SimpleGeo was born, which provides location services to developers.
  16. While building SimpleGeo I decided to, after 11 years, switch from PHP to Python as my language of choice.
  17. The change of direction was a watershed moment for the company. Things crystallized for both us and the investors we were pitching. It wasn’t long after this that First Round Capital agreed to be our lead investor.
  18. October was mostly sent flying around to New York City and San Francisco pitching investors, VC’s, etc.
  19. In November we closed a $1.5m round of financing from some of tech’s most well-known investors. I consider this to be the greatest achievement of my career so far.
  20. Over Thanksgiving I spent a few days down in Tulum, Mexico.
  21. In December I flew up to Seattle, WA for a quick visit. It’s still home to me and I can’t wait to move back.

According to TripIt and Dopplr I spent 142 days traveling this year. I don’t have complete numbers, but I’m guessing I logged over 80,000 miles this year on various airlines. As in the tradition of last year, I think it’s only appropriate that I create a list of my year in cities.

  • Koh Samui, Thailand
  • Koh Phangan, Thailand
  • Bangkok, Thailand
  • Seattle, Washington
  • Miami, Florida
  • Austin, Texas
  • Sebastopol, California
  • Ypsiltanti, Michigan
  • Ann Arbor, Michigan
  • East Jordan, Michigan
  • San Francisco, California
  • Prague, Czech Republic
  • Phoenix, Arizona
  • Boulder, Colorado
  • Amsterdam, The Netherlands
  • Roskilde, Denmark
  • Oslo, Norway
  • Bergen, Norway
  • Askvoll, Norway
  • Dublin, Ireland
  • Cork, Ireland
  • London, United Kingdom
  • New York City, New York
  • San Diego, California
  • Minneapolis, Minnesota
  • Ashland, Wisconsin

Disk IO and throughput benchmarks on Amazon’s EC2

When I told people that we were going to run our infrastructure on Amazon’s EC2 most people recoiled in disgust. I heard lots and lots of horror stories about how you simply couldn’t run production environments on EC2. Disk IO was horrible, throughput was bad, etc. Someone should seriously tell Amazon, since a lot of their own infrastructure runs in EC2 and AWS. For the most part, Amazon’s AWS were internal tools that they released publicly.

We’ve been ironing out kinks in our production environment for the last few weeks and one of the things that worried me was if these assertions were true. So, I set out to run a fairly comprehensive test of Disk IO and throughput. I ran hdparm -t, bonnie++, and iozone against ephemeral drives in various configurations along with EBS volumes in various configurations.

For all of my tests I tested the regular ephemeral drives as they were installed (“Normal”), LVM in a JBOD setup (“LVM”), the two ephemeral drives in a software RAID0 setup (“RAID0”), a single 100GB EBS volume (“EBS”), and two 100GB EBS volumes in a software RAID0 setup (“EBS RAID0”). All of the tests were ran on large instances (m1.large). All tests were ran using the XFS file system.

hdparm -t

EC2_RAID_JBODWhile hdparm -t isn’t the most comprehensive test in the world, I think it’s a decent gut check for simple throughput. To give a little context, I remember Digg’s production servers, depending on setup, ranging from 180MB/sec. to 340MB/sec. I’m guessing if you upgraded to the XL instance type and did a RAID0 across the four drives it has you’d see even better numbers with the RAID0 ephemeral drives.

What I also found pretty interesting about these numbers is that the EBS volumes stacked up “okay” against the ephemeral drives and that the EBS volumes in a RAID0 didn’t gain us a ton of throughput. Considering that the EBS volumes run over the network, which I assume is gigabit ethernet, 94MB/sec. is pretty much saturating that network connection and, to say the least, impressive given the circumstances.

For most applications, I’d guess that EBS throughput is just fine. The raw throughput only becomes a serious requirement when you’re moving around lots of large files. Most applications move around lots of small files, which I’d likely use S3 for anyways. If your application needs to move around lots of large files I’d consider using a RAID0 ephemeral drive setup with redundancy (e.g. MogileFS spreading files across many nodes in various data centers).

bonnie++

20091209-cgq7p1my7mtm5wmsyrjpas5xqfDisregarding the Input/Output Block performance of the ephemeral RAID0 setup here, it’s extremely interesting to note that EBS IO performance is better than the ephemeral drives and that EBS in a RAID0 was better in almost every metric as the ephemeral drives.

That all being said, RAID0 ephemeral drives are the clear winner here. I do wonder, however, if you could set up a RAID0 EBS array that had, say, four or six or eight volumes that’d be faster than the RAID0 setup.

If your application is IO bound then I’d probably recommend using EBS volumes if you can afford it. Otherwise, I’d use RAID0. Again, the trick with the ephemeral drives is to ensure your data is replicated across multiple nodes. Of course, this is cloud computing we’re talking about, so you should be doing that anyways.

EC2_RAID_JBOD-1Here’s the CPU numbers of the various configurations. One thing to note here is that EBS, LVM, and software RAID all come with CPU costs. Somewhat interesting to note is that the EBS has substantially less CPU usage in all areas except Input/Output Per Char.

If your application is both CPU and IO bound then I’d probably recommend upgrading your instance to an XL.

20091209-8kybaf7peu91m3puugxpr78reuThe last bonnie++ results are the random seeks per second and, wow, was I surprised. A single EBS runs pretty much dead even with the LVM JBOD and the EBS RAID0 is on par with the RAID0 ephemeral drives.

To say I was surprised by these numbers would be an understatement. The lesson here is that, if your application does lots of random seeking, you’ll want to use either EBS RAID0 volumes or RAID0 ephemeral drives.

iozone

Before running these tests I’d never even heard of this application, but it seemed to be used by quite a few folks so I thought I’d give it a shot.

20091209-ejjcihfsn16hjdnrnfy1g2ptai

Again, some interesting numbers from EBS volumes. What I found pretty interesting here is that the EBS RAID0 setups actually ended up being slower in a few metrics than a single EBS volume. No idea why that may be.

The other thing to note is that the single EBS volume outperformed the ephemeral RAID0 setup in a few different metrics, most notably being random writes.

Conclusions

I think the overall conclusion here is that disk IO and throughput on EC2 is pretty darn good. I have a few other conclusions as well.

  • If you can replicate your data across multiple nodes then the RAID0 ephemeral drives are the clear winners.
  • If you are just looking to store lots of small files and serve them up, then definitely use S3 with CloudFront.
  • You’d very likely get even more impressive numbers using the XL instances with RAID0 striped across four ephemeral drives.
  • Another potential disk setup would be to put different datasets on different ephemeral drives. For instance, put one MySQL database on one ephemeral drive and your other one on another.
  • If your setups are IO bound and you’re looking for lots of redundancy, then EBS volumes are likely the way to go. If you don’t need the super redundancy on a single box then use RAID0 on ephemeral drives.

Welcome to stu.mp

Whoa! What happend to the old site?! Well, after many years of rolling my own code, design, HTML, CSS, etc. I finally gave up and had some professionals take care of the hard parts. I’ve gave up rolling my own code in favor of using WordPress a while ago and am now using the Lifestream plugin for this site, which I hacked to bits.

The more exciting part, I think, though is the design. It’s been in my head for years, but never could figure a way to convert itself into bits and bytes. Thankfully, the highly skilled Jake Mix was able to take care of the gorgeous illustrations. Once I had the illustrations in hand, I had the young protégé Julian Targowski put the various bits together into the amazing design you see now.

There are still a few bits and bytes that are not aligned properly and I’m 99% sure that the site is totally broken in Internet Explorer (a fact I care little about). If you like what you see I highly recommend contacting Jake and/or Julian.