Distributed vs. Fault Tolerant Systems

I’ve been researching implementations of distributed search technology for some things we want to do at SimpleGeo. It was about this time that I ran across a project called Katta, which is a “distributed” implementation of Lucene indexes. While perusing the documentation I ran across a diagram detailing the architecture of Katta. What struck me as odd was that Katta was purporting to be a distributed system, yet had a master/slave setup managing it. This led me to tweet out:

Dear Engineers, It’s not a “distributed” system if you have a master/slave setup. That’s simply a “redundant” system. #

Not long after this I got a few inquiries along with some great input from the ever-wise Jan Lehnardt from the CouchDB project along with Christopher Brown:

RT @joestump: “…It’s not a ‘distributed’ system if you have a master/slave setup. That’s simply ‘redundant’” — Same: distributed != sharding #

And it’s not “distributed” if it’s all hub-n-spokes.  Might as well have a central server with some extra storage. #

Basically, Jan and I are making the argument that redundancy or sharding/partitioning doesn’t really add up to a truly distributed system. Well, not in the sense of what I think of when I think of distributed. Going back to my old friend, the CAP Theorem, I think we can see why.

Redundancy could be argued to be the A in CAP; always available. I’d even argue that partitioning, in the sense that most people think of (e.g. partitioning 300m users across 100 MySQL servers), is the A in CAP. The P for partitioning in CAP is that a system is highly tolerant to network partitioning. Because of the master/slave setup of Katta, it really only implements the A in CAP.

Of course, CAP says that you can only have two in any distributed system. I’d make the argument that no system is truly distributed unless it fully implements two of the three properties in CAP. I’d further argue that if you had a highly available (A), highly consistent system (C) even it wouldn’t be distributed (lacking the ability to be highly tolerant to network partitioning).

The problem that I have with Katta’s definition of distributed is that I can’t truly distribute Katta in the raw sense of the word. For instance, I can’t spread Katta across three data centers and not worry about a data center going down. If I lose access to the master then Katta is worthless to me.

To have a truly distributed system I’d argue you need the following characteristics in your software:

  • There can be no master (hub).
  • There must be multiple copies of data spread across multiple nodes in multiple data centers.
  • Software and systems must be tolerant of data center failures.

My point is, I think the word “distributed” is applied too freely to systems that are, more appropriately, called “redundant” or “fault tolerant”.

UPDATE: Seems I confused some people about my master/slave argument. You can build distributed systems that have master/slave pieces in the puzzle. Large distributed systems are rarely one cohesive piece of software, rather they tend to be a series of systems glued together. What I am saying is that a simple master/slave system is not distributed. Through a series of bolts, glue, add-ons, and duct tape you can make a master/slave system sufficiently distributed, but don’t confuse the totality of that system with the base master/slave system.