Distributed vs. Fault Tolerant Systems

Posted on Sunday, December 27th, 2009 at 10:27 am in architecture, clustering, coding

I’ve been researching implementations of distributed search technology for some things we want to do at SimpleGeo. It was about this time that I ran across a project called Katta, which is a “distributed” implementation of Lucene indexes. While perusing the documentation I ran across a diagram detailing the architecture of Katta. What struck me as odd was that Katta was purporting to be a distributed system, yet had a master/slave setup managing it. This led me to tweet out:

Dear Engineers, It’s not a “distributed” system if you have a master/slave setup. That’s simply a “redundant” system. #

Not long after this I got a few inquiries along with some great input from the ever-wise Jan Lehnardt from the CouchDB project along with Christopher Brown:

RT @joestump: “…It’s not a ‘distributed’ system if you have a master/slave setup. That’s simply ‘redundant’” — Same: distributed != sharding #

And it’s not “distributed” if it’s all hub-n-spokes.  Might as well have a central server with some extra storage. #

Basically, Jan and I are making the argument that redundancy or sharding/partitioning doesn’t really add up to a truly distributed system. Well, not in the sense of what I think of when I think of distributed. Going back to my old friend, the CAP Theorem, I think we can see why.

Redundancy could be argued to be the A in CAP; always available. I’d even argue that partitioning, in the sense that most people think of (e.g. partitioning 300m users across 100 MySQL servers), is the A in CAP. The P for partitioning in CAP is that a system is highly tolerant to network partitioning. Because of the master/slave setup of Katta, it really only implements the A in CAP.

Of course, CAP says that you can only have two in any distributed system. I’d make the argument that no system is truly distributed unless it fully implements two of the three properties in CAP. I’d further argue that if you had a highly available (A), highly consistent system (C) even it wouldn’t be distributed (lacking the ability to be highly tolerant to network partitioning).

The problem that I have with Katta’s definition of distributed is that I can’t truly distribute Katta in the raw sense of the word. For instance, I can’t spread Katta across three data centers and not worry about a data center going down. If I lose access to the master then Katta is worthless to me.

To have a truly distributed system I’d argue you need the following characteristics in your software:

  • There can be no master (hub).
  • There must be multiple copies of data spread across multiple nodes in multiple data centers.
  • Software and systems must be tolerant of data center failures.

My point is, I think the word “distributed” is applied too freely to systems that are, more appropriately, called “redundant” or “fault tolerant”.

UPDATE: Seems I confused some people about my master/slave argument. You can build distributed systems that have master/slave pieces in the puzzle. Large distributed systems are rarely one cohesive piece of software, rather they tend to be a series of systems glued together. What I am saying is that a simple master/slave system is not distributed. Through a series of bolts, glue, add-ons, and duct tape you can make a master/slave system sufficiently distributed, but don’t confuse the totality of that system with the base master/slave system.

It's not the language stupid

Posted on Tuesday, April 15th, 2008 at 7:49 pm in architecture, code, database, db, scaling

I’ve said it once, I’ve said it twice, I’ve screamed it from the top of mountains and yet nobody listens. I’m sitting in a session at the MySQL Conference and the person presenting just said, “You have to have well written code to avoid bottlenecks.” This is, put bluntly, stupid and patently false. Let me explain.

  • Your true bottlenecks when scaling are very rarely, if ever, because of your language. Sure Ruby is slower than PHP or Perl or Python, but only incrementally so and it’s only going to get faster. Even if your language is your problem it’s the easiest part of your architecture to scale; add more hardware.
  • Just because your code is well written doesn’t mean it will perform well and, conversely, just because you write shitty code doesn’t mean your code will not perform well. I’ve seen some seriously shitty PHP code that’s blazing fast because it’s so simple.
  • Depending on your application, as you grow you’ll find that your scaling issues come down to one fundamental problem: I/O. DB I/O, file system / disk I/O, network traffic, etc, etc. Ask anyone who’s written a large scale application where their growing pains were and I’ll bet my last dollar it wasn’t “PHP/Python/Ruby/Perl/Java/COBOL is slow”. I’m betting they’ll say something along the lines of “MySQL took a crap on us after we hit 200,000,000 records and had to do date range scans.” Or they’ll say, “I was storing user generated content and NFS couldn’t scale to the amount of requests for that content.”

I’m sick and tired of the language zealots who say PHP is slower than Perl or Ruby is slower than PHP or Java  sucks because which language you’re using has zero to do with that missing index on your table or the fact that you can’t store all of that user generated content.

It comes down to your architecture and, despite what the zealots would have you believe, the language you choose is only one component of your overall architecture. Choose what you know and run with it.

Projects

  • Ready-to-use location infrastructure for developers.

    SimpleGeo

    Ready-to-use location infrastructure for developers.

  • Consistent, predictable Twitter avatars backed by an enterpise CDN.

    tweetimag.es

    Consistent, predictable Twitter avatars backed by an enterpise CDN.

Open Source Projects

Categories