In mid 2000, Eric A. Brewer, a former founder of Inktomi and chief scientist at Yahoo! and now currently a professor of Computer Science at U.C. Berkeley, presented a keynote speech at the ACM Symposium on the Principles of Distributed Computing. In his seminal speech, Brewer described a theorem based on research and observations he made called the CAP theorem.
The CAP theorem is based on the observation that a distributed system is governed by three fundamental characteristics:
- Partition tolerance
CAP is a is a useful tool in understanding the behavior of a distributed system. It states that given the three fundamental characteristics of a distributed computing system, you may have any two but never all three. It's usefulness in designing and building distributed systems cannot be overstated. So, how can we use the knowledge of this theorem to our advantage?
As the designer of an enterprise scale system, CAP provides us with a framework to make decisions regarding which tradeoffs must be made in our own implementations. But CAP allows us not only to understand the systems we are building more precisely, but also provides a framework by which we can classify all systems. It is thus an invaluable tool when evaluating the systems that we rely on day in and day out in our enterprise systems.
As an example, let's analyze the traditional Relational Database Management System (RDBMS). The RDBMS, arguably one of the most successful enterprise technologies in history, has been around in its current form for nearly 40 years! The primary reason for the staying power of the RDBMS lies with its ability to provide consistency. A consistent system is most easily understood and reasoned about, and therefore most readily adopted (thus explaining the popularity of the RDBMS). But what of the other properties? An RDBMS provides availability, but only when there is connectivity between the client accessing the RDBMS and the RDBMS itself. Thus it can be said that the RDBMS does not provide partition tolerance - if a partition arises between the client and the RDBMS, the system will not be able to function properly. In summary, we can thus characterize the RDBMS as a CA system due to the fact that it provides Consistency and Availability but not Partition tolerance.
As useful as this mechanism is, we can go one step further. Given that a system will always lack one of C, A, or P, it is common that mature systems have evolved a means of partially recovering the lost CAP characteristic. In the case of our RDBMS example, there are several well-known approaches that can be employed to compensate for the lack of Partition tolerance. One of these approaches is commonly referred to as master/slave replication. In this scheme, database writes are directed to a specially designated system, or master. Data from the master is then replicated to one or more additional, or slave, systems. If the master is offline then reads may be failed over to any one of the surviving read replica slaves.
Thus, in addition to characterizing systems by their CAP traits, we can further characterize them by identifying the recovery mechanism(s) they provide for the lacking CAP trait. In the remainder of this article I classify a number of popular systems in use today in enterprise, and non-enterprise, distributed systems. These systems are:
- Amazon Dynamo
- Oracle Coherence
- Google BigTable
Recovery Mechanisms: Master/Slave replication, Sharding
RDBMS systems are fundamentally about providing availability and consistency of data. The gold standard of RDMBS updates, referred to as ACID, governs the way in which consistent updates are recorded and persisted.
Various means of improving RDBMS performance are available in commercial systems. Due to the maturity of the RDBMS, these mechanisms are well understood. For example, the consistency conflicting reads and writes during the course of a transaction is referred to as isolation levels. The commonly accepted set of isolation levels, in decreasing order of consistency (and increasing order of performance), are:
- REPEATABLE READ
- READ COMMITTED
- READ UNCOMMITTED
- Master/Slave replication: A single master accepts writes, data is replicated to slaves. Data read from slaves may be slightly out of date, trading off some amount of Consistency to provide Partition tolerance.
- Sharding: While not strictly limited to database systems, sharding is commonly used in conjunction with a database system. Sharding refers to the practice of separating the entire application into vertical slices which are 100% independent of one another. Once completed, sharding isolates failures of any one system into "swimlanes" and is one example of "fault isolative architectures", thus limiting the impact of any single failure or related sets of failure to only one portion of an application. Sharding provides some measure of resistance to Partition tolerance by assuming that failures occur on a small enough scale to be isolated to a single shard, leaving the remaining shards operational.
Recovery: Read-repair, application hooks
Amazon's Dynamo is a private system designed and used solely by Amazon. Dynamo was intentionally designed to provide Availability and Partitioning tolerance, but not Consistency. This appearance of Amazon's Dynamo was very nearly as seminal as the introduction of the CAP theorem itself. Due to the dominance of the database, until Amazon introduced Dynamo to the world, it was very nearly a mainstay that enterprise systems must provide Consistency and therefore the tradeoffs available lie in the remaining two CAP characteristics of Availability or Partition tolerance.
Examining the requirements for Amazon's Dynamo, it's clear why the designers chose to buck the trend: Amazon's business model depends heavily on availability. Even the simplest of estimates pegs the losses Amazon could suffer from an outage at a minimum of $30,000 per minute. Given that Amazon's growth has nearly quadrupled since these estimates were made (in 2008), we can estimate that in 2010 Amazon may lose as much as $100,000 per minute. Put simply, availability matters a lot at Amazon. Furthermore, the fallacies of distributed computing already know tells us that the network is unreliable, and so therefore we must expect partitions to occur on a regular and frequent basis. So it's a simple matter then to see that the only remaining CAP characteristic left to sacrifice is Consistency.
Dynamo provides an eventual consistency model, where all nodes will eventually get all updates during their lifetime.
Given a system composed of N nodes, the eventual consistency model is tuned as follows:
- Setting the number of writes needed for a successful write operation (W).
- Setting the number of reads needed for a successful write operation (R).
Given that different nodes may have different versions of the same value (i.e., a value may have been written during a node downtime), Dynamo needs to:
- Track versions and resolve conflicts.
- Propagate new values.
New values are propagated by using hinted handoff and merkle trees.
Recovery: Quorum vote, majority partition survival
Terracotta is a Java-based distributed computing platform that provides high level features such as Caching via EHCache and highly available scheduling via Quartz. Additional support for Hibernate second level caching allows architects to easily adopt Terracotta in a standard JEE architecture that relies on Spring, Hibernate and an RDBMS.
Terracotta's architecture is similar to that of a database. Clients connect to one or more servers arranged into a "Server Array" layer. Updates are always Consistent in a Terracotta cluster, and availability is guaranteed so long as no partitions exist in the topology.
- Quorum: Upon failure of a single server, a backup server may take over once it has received enough votes from cluster members to elect itself the new master.
- Majority partition survival: In the event of a catastrophic partition involving many members of a Terracotta cluster that divides the cluster into one or more non-communicative partitions, the partition with a majority of remaining nodes is allowed to continue after a pre-configured period of time elapses.
Recovery: Partitioning, Read-replicas
Oracle Coherence is an in-memory Java data-grid and caching framework. It's main architectural component is its ability to provide consistency (hence it's name). All data in Oracle Coherence has at most one home. Data may be replicated to a configurable number of additional members in the cluster. When a system fails, replica systems vote on who becomes the new home for data that was homed in the failed system.
Coherence provides data-grid features that facilitate processing data using map-reduce like techniques (execute the work on the data, instead of moving data to the processing) and a host of distributed computing patterns are available in the incubator patterns.
- Data Partitioning: At a granular level, any one piece of data exhibits CA properties, that is to say that reads and writes of data in Coherence are always Consistent. As long as no partitions exist, data is available, meaning that for a particular piece of data, Coherence is not Partition tolerant. However, similar to the database sharding mechanism data may be partitioned across the cluster nodes, meaning that a partition will only affect a sub-set of all data.
- Read-replication: Coherence caches may be configured in varying topologies. When a Coherence cache is configured in read-replicated mode it exhibits CA. Data is consistent but writes block in the face of partitions.
CAP: CA or AP, depending on the replication scheme chosen
Recovery: Per-key data partitioning
GigaSpaces is a Java based application server that is fundamentally built around the notion of Space-based computing, an idea derived from Tuple Spaces which was the core foundation of the Linda programming system.
GigaSpaces provides high availability of data placed in the space by means of synchronous and an asynchronous replication scheme.
In a synchronous replication mode, GigaSpaces provides Consistency and Availability. The system is consistent and available, but can not tolerate partitions. In an asynchronous replication mode, GigaSpaces provides Availability and Partition tolerance. The system is available for reads and writes, but is only eventually consistent (after the asynchronous replication completes).
- Per-key data partitioning - GigaSpaces supports a mode called Partitioned-Sync2Backup. This allows for data to be partitioned based on a key to lower the risk of shared fate and to provide a synchronous copy for recovery.
Recovery: Partitioning, Read-repair
Apache Cassandra was developed by Facebook, using the same principles as Amazon's Dynamo, thus it is no surprise that Cassandra's CAP traits are the same as Dynamo's.
For read-recovery, Cassandra uses simple timestamps instead of the more difficult vector clocks implementation used by Amazon's Dynamo.
Apache CouchDB is a document oriented database that is written in Erlang.
Recovery: Configurable read-repair
Project Voldemort is an open-source distributed key value store developed by LinkedIn and released as open source in February of 2009. Voldemort exhibits similar characteristics as Amazon's Dynamo. It uses vector clocks for version detection and read-repair.
- Read-repair with versioning using vector clocks.
Google's BigTable is, according to Wikipedia, "a sparse, distributed multi-dimensional sorted map", sharing characteristics of both row-oriented and column-oriented databases." It relies on Google File System (GFS) for data replication.