Sunday, January 03, 2010

Characterizing Enterprise Systems using the CAP theorem

In mid 2000, Eric A. Brewer, a former founder of Inktomi and chief scientist at Yahoo! and now currently a professor of Computer Science at U.C. Berkeley, presented a keynote speech at the ACM Symposium on the Principles of Distributed Computing. In his seminal speech, Brewer described a theorem based on research and observations he made called the CAP theorem.


The CAP theorem is based on the observation that a distributed system is governed by three fundamental characteristics:

  1. Consistency
  2. Availability
  3. Partition tolerance


CAP is a is a useful tool in understanding the behavior of a distributed system. It states that given the three fundamental characteristics of a distributed computing system, you may have any two but never all three. It's usefulness in designing and building distributed systems cannot be overstated. So, how can we use the knowledge of this theorem to our advantage?


As the designer of an enterprise scale system, CAP provides us with a framework to make decisions regarding which tradeoffs must be made in our own implementations. But CAP allows us not only to understand the systems we are building more precisely, but also provides a framework by which we can classify all systems. It is thus an invaluable tool when evaluating the systems that we rely on day in and day out in our enterprise systems.


As an example, let's analyze the traditional Relational Database Management System (RDBMS). The RDBMS, arguably one of the most successful enterprise technologies in history, has been around in its current form for nearly 40 years! The primary reason for the staying power of the RDBMS lies with its ability to provide consistency. A consistent system is most easily understood and reasoned about, and therefore most readily adopted (thus explaining the popularity of the RDBMS). But what of the other properties? An RDBMS provides availability, but only when there is connectivity between the client accessing the RDBMS and the RDBMS itself. Thus it can be said that the RDBMS does not provide partition tolerance - if a partition arises between the client and the RDBMS, the system will not be able to function properly. In summary, we can thus characterize the RDBMS as a CA system due to the fact that it provides Consistency and Availability but not Partition tolerance.


As useful as this mechanism is, we can go one step further. Given that a system will always lack one of C, A, or P, it is common that mature systems have evolved a means of partially recovering the lost CAP characteristic. In the case of our RDBMS example, there are several well-known approaches that can be employed to compensate for the lack of Partition tolerance. One of these approaches is commonly referred to as master/slave replication. In this scheme, database writes are directed to a specially designated system, or master. Data from the master is then replicated to one or more additional, or slave, systems. If the master is offline then reads may be failed over to any one of the surviving read replica slaves.


Thus, in addition to characterizing systems by their CAP traits, we can further characterize them by identifying the recovery mechanism(s) they provide for the lacking CAP trait. In the remainder of this article I classify a number of popular systems in use today in enterprise, and non-enterprise, distributed systems. These systems are:

  • RDBMS
  • Amazon Dynamo
  • Terracotta
  • Oracle Coherence
  • GigaSpaces
  • Cassandra
  • CouchDB
  • Voldemort
  • Google BigTable

RDBMS

CAP: CA

Recovery Mechanisms: Master/Slave replication, Sharding


RDBMS systems are fundamentally about providing availability and consistency of data. The gold standard of RDMBS updates, referred to as ACID, governs the way in which consistent updates are recorded and persisted.


Various means of improving RDBMS performance are available in commercial systems. Due to the maturity of the RDBMS, these mechanisms are well understood. For example, the consistency conflicting reads and writes during the course of a transaction is referred to as isolation levels. The commonly accepted set of isolation levels, in decreasing order of consistency (and increasing order of performance), are:

  • SERIALIZABLE
  • REPEATABLE READ
  • READ COMMITTED
  • READ UNCOMMITTED


Recovery mechanisms:

  • Master/Slave replication: A single master accepts writes, data is replicated to slaves. Data read from slaves may be slightly out of date, trading off some amount of Consistency to provide Partition tolerance.
  • Sharding: While not strictly limited to database systems, sharding is commonly used in conjunction with a database system. Sharding refers to the practice of separating the entire application into vertical slices which are 100% independent of one another. Once completed, sharding isolates failures of any one system into "swimlanes" and is one example of "fault isolative architectures", thus limiting the impact of any single failure or related sets of failure to only one portion of an application. Sharding provides some measure of resistance to Partition tolerance by assuming that failures occur on a small enough scale to be isolated to a single shard, leaving the remaining shards operational.

Amazon Dynamo

CAP: AP

Recovery: Read-repair, application hooks


Amazon's Dynamo is a private system designed and used solely by Amazon. Dynamo was intentionally designed to provide Availability and Partitioning tolerance, but not Consistency. This appearance of Amazon's Dynamo was very nearly as seminal as the introduction of the CAP theorem itself. Due to the dominance of the database, until Amazon introduced Dynamo to the world, it was very nearly a mainstay that enterprise systems must provide Consistency and therefore the tradeoffs available lie in the remaining two CAP characteristics of Availability or Partition tolerance.


Examining the requirements for Amazon's Dynamo, it's clear why the designers chose to buck the trend: Amazon's business model depends heavily on availability. Even the simplest of estimates pegs the losses Amazon could suffer from an outage at a minimum of $30,000 per minute. Given that Amazon's growth has nearly quadrupled since these estimates were made (in 2008), we can estimate that in 2010 Amazon may lose as much as $100,000 per minute. Put simply, availability matters a lot at Amazon. Furthermore, the fallacies of distributed computing already know tells us that the network is unreliable, and so therefore we must expect partitions to occur on a regular and frequent basis. So it's a simple matter then to see that the only remaining CAP characteristic left to sacrifice is Consistency.


Dynamo provides an eventual consistency model, where all nodes will eventually get all updates during their lifetime.

Given a system composed of N nodes, the eventual consistency model is tuned as follows:

  • Setting the number of writes needed for a successful write operation (W).
  • Setting the number of reads needed for a successful write operation (R).

Setting W = N or R = N will give you a quorum-like system with strict consistency and no partition tolerance. Setting W <>


Given that different nodes may have different versions of the same value (i.e., a value may have been written during a node downtime), Dynamo needs to:

  • Track versions and resolve conflicts.
  • Propagate new values.
Versioning is implemented by using vector clocks: each value is associated to a list of (node, value) pairs updated every time a specific node writes that value; they can be used to determine causal ordering and branching. Conflict resolution is done during reads (read repair), eventually merging values with diverging vector clocks and writing back.

New values are propagated by using hinted handoff and merkle trees.


Terracotta

CAP: CA

Recovery: Quorum vote, majority partition survival


Terracotta is a Java-based distributed computing platform that provides high level features such as Caching via EHCache and highly available scheduling via Quartz. Additional support for Hibernate second level caching allows architects to easily adopt Terracotta in a standard JEE architecture that relies on Spring, Hibernate and an RDBMS.


Terracotta's architecture is similar to that of a database. Clients connect to one or more servers arranged into a "Server Array" layer. Updates are always Consistent in a Terracotta cluster, and availability is guaranteed so long as no partitions exist in the topology.


Recovery Mechanisms:

  • Quorum: Upon failure of a single server, a backup server may take over once it has received enough votes from cluster members to elect itself the new master.
  • Majority partition survival: In the event of a catastrophic partition involving many members of a Terracotta cluster that divides the cluster into one or more non-communicative partitions, the partition with a majority of remaining nodes is allowed to continue after a pre-configured period of time elapses.

Oracle Coherence

CAP: CA

Recovery: Partitioning, Read-replicas


Oracle Coherence is an in-memory Java data-grid and caching framework. It's main architectural component is its ability to provide consistency (hence it's name). All data in Oracle Coherence has at most one home. Data may be replicated to a configurable number of additional members in the cluster. When a system fails, replica systems vote on who becomes the new home for data that was homed in the failed system.


Coherence provides data-grid features that facilitate processing data using map-reduce like techniques (execute the work on the data, instead of moving data to the processing) and a host of distributed computing patterns are available in the incubator patterns.


Recovery Mechanism(s):

  • Data Partitioning: At a granular level, any one piece of data exhibits CA properties, that is to say that reads and writes of data in Coherence are always Consistent. As long as no partitions exist, data is available, meaning that for a particular piece of data, Coherence is not Partition tolerant. However, similar to the database sharding mechanism data may be partitioned across the cluster nodes, meaning that a partition will only affect a sub-set of all data.
  • Read-replication: Coherence caches may be configured in varying topologies. When a Coherence cache is configured in read-replicated mode it exhibits CA. Data is consistent but writes block in the face of partitions.


GigaSpaces

CAP: CA or AP, depending on the replication scheme chosen

Recovery: Per-key data partitioning


GigaSpaces is a Java based application server that is fundamentally built around the notion of Space-based computing, an idea derived from Tuple Spaces which was the core foundation of the Linda programming system.


GigaSpaces provides high availability of data placed in the space by means of synchronous and an asynchronous replication scheme.


In a synchronous replication mode, GigaSpaces provides Consistency and Availability. The system is consistent and available, but can not tolerate partitions. In an asynchronous replication mode, GigaSpaces provides Availability and Partition tolerance. The system is available for reads and writes, but is only eventually consistent (after the asynchronous replication completes).


Recovery Mechanism(s):

  • Per-key data partitioning - GigaSpaces supports a mode called Partitioned-Sync2Backup. This allows for data to be partitioned based on a key to lower the risk of shared fate and to provide a synchronous copy for recovery.


Apache Cassandra

CAP: AP

Recovery: Partitioning, Read-repair


Apache Cassandra was developed by Facebook, using the same principles as Amazon's Dynamo, thus it is no surprise that Cassandra's CAP traits are the same as Dynamo's.


For read-recovery, Cassandra uses simple timestamps instead of the more difficult vector clocks implementation used by Amazon's Dynamo.


Recovery Mechanism(s):

  • Partitioning
  • Read-repair


Apache CouchDB

CAP: AP

Recovery:


Apache CouchDB is a document oriented database that is written in Erlang.


Voldemort

Link: http://project-voldemort.com

CAP: AP

Recovery: Configurable read-repair


Project Voldemort is an open-source distributed key value store developed by LinkedIn and released as open source in February of 2009. Voldemort exhibits similar characteristics as Amazon's Dynamo. It uses vector clocks for version detection and read-repair.


Recovery Mechanism(s):

  • Read-repair with versioning using vector clocks.


Google BigTable

CAP: CA

Recovery:


Google's BigTable is, according to Wikipedia, "a sparse, distributed multi-dimensional sorted map", sharing characteristics of both row-oriented and column-oriented databases." It relies on Google File System (GFS) for data replication.



This blog post is still a work in progress. There are many other systems that are worthwhile to evaluate, among them Terrastore, Erlang based frameworks like Mnesia, and message based systems such as Scala Actors, and Akka, among others. If you would like to see something else, please mention it in the comments.

Thanks also go to Sergio Bossa for assistance in writing this blog post.

30 comments:

Alex Miller said...

So the naive question from looking at the players in the field is when is Availability not important? All of these systems make the choice between consistency and partition tolerance and sometimes in the degree of availability.

Kudos on the recovery mechanisms here - seems like that is the shades of grey that makes this insightful.

Heath said...

Maybe batch computing systems? If there is a partition, they could just start over.

Gene Tani said...

You might look at Mongodb, in C++, and the 2 other erlang KV stores, scalaris, and riak, which i think lets you set N, R and W explicitly (riak only)

http://riak.basho.com/
http://code.google.com/p/scalaris/

Randall said...

You might also look at Infinispan, an open-source data grid implemented in Java that is AP (and eventually consistent).

Taylor said...

@Alex - I don't think your question is naive at all. When I started this post, I thought I had it all figured out. Then, as I was researching it, I realized that I couldn't identify any CP systems - what exactly does it mean to lose availability but retain consistency and partition tolerance? And if that is the case, is this theorem really useful? Because surely someone would have sacrificed A if it was useful to do so?

I suppose you could assume that a 404 or MySQL connection error amounts to CP (or some other more graceful loss of service), but obviously those aren't intentional (or desirable). So that really just leaves us with CA and AP systems, and the wide range in between that as you point out is the truly interesting bit - the recovery mechanisms or how the system deals with the failures.

@Gene - thanks for the suggestions I will look into these

@Randall - yes you are right Infinispan belongs in this list.

Sergio Bossa said...

@Alex

AFAIU, the point is that every system providing C and P, but not A, is pretty meaningless: it just means that it can tolerate lost messages between nodes (by definition of partition tolerance), but is allowed not to answer in bounded time (because sacrificing consistency).

Sergio Bossa said...

@Taylor

I think indeed that whatever response, even an error one (such as internal server error / connection error), implies availability.

Sergio Bossa said...

Moreover, I think that the discussion should include PNUTS: http://research.yahoo.com/node/2304

In particular because it provides an important "solution" for CA systems in order to recover from partitions: replication via messaging services and migration of requests.

Gary Berger said...

Great article, I would just comment that Gigaspaces has a mode called Partitioned-Sync2Backup. This allows for data to be partitioned based on a key to lower the risk of shared fate and to provide a synchronous copy for recovery.

Nati Shalom said...

Taylor

First of all congratulation on your new job - when did you move from Terracotta?

Overall a fairly balanced overview.

Having said that I did found the GigaSpaces part fairly inaccurate WRT to the recovery and consistency model, let me explain:

1. Each partition maintains a backup partition.

2. When a partition fails it recovers its state from the available partition. A primary election takes place in parallel to the recovery process in which the backup takes ownership if the primary fails.

3. Unlike most other alternatives when one of the nodes fails GigaSpaces will proactively provision a new instance on a new machine to ensure that the number of primary and backups meet the required SLA. In a cloud environment it will start a new machine.

4. On the client side failure will be implicitly routed to the "hot backup".

5. We provide something that is referred to as "reliable asynchronous replication. In that mode replication is synchronous to one node (the hot backup) and synchronous to the rest. That means that even in asynchronous replication data will not be lost in case of a failure.

6. In asynchronous mode we provide replication filter will enable you to track version conflicts.

7. GigaSpaces support map/reduce query implicitly or explicitly. Implicitly through our SQL Query support and explicitly through our Executor API. You can also use java.util.Concurrent API on top of our executor framework.

8. We provide transaction support and data affinity that are important mechanism to ensure consistency while minimizing the performance cost associated with it.


There are plenty more to add on that regard. Rather then continuing on that line I'd simply recommend reading our docs for those interested in more details.

Taylor said...

@Gary and @Nati,

Thanks for your comments I will incorporate them.

Stan Klimoff said...

Taylor,

From what I recall, partition tolerance stands for yielding correct behavior in face of network failures.

* If a system is strongly consistent (C), different partitions must yield consistent data even in face of partitioning.

* If a system is not strongly consistent (~C, regardless how 'eventual' is the degree of consistency is), different partitions can yield somewhat inconsistent data — but this is considered to be correct behavior.

Now, let's contemplate a system that is strongly consistent and partition-tolerant (i. e. consistency does not break in face of network errors -> no split-brain behavior). One possible way to achieve consistency in face of network partition is to block the transaction until the quorum is restored. I would consider blocking for indefinite period of time to be an availability sacrifice, so here you have a CP system.

My guess is that all systems based on two-phase commit protocol would exhibit this behavior — e. g. usual replicated RDBMS.

What do you think?

Nati Shalom said...

Correction on item 5 on my earlier response:

"5. We provide something that is referred to as "reliable asynchronous replication". In that mode replication is synchronous to one node (the hot backup) and asynchronous to the rest. That means that even in asynchronous replication data will not be lost in case of a failure."

Nati Shalom said...

@Taylor

"Thanks for your comments I will incorporate them."

I still see Recovery:None under GigaSpaces.

Artur said...

First - cap theorem doesn't state that you always get 2 of those possibilities. I can easily created distributed system which doesn't exhibit any of those properties.

Second - it is about _distributed_ systems. RDBMS by is not a distributed system on it's own. You can think about application + RDBMS-with-mirroring + hot swap to mirror in case of master failure as a distributed system. If your mirroring is not blocking main DB till it receives commit ack from slave, such system will not be consistent. It will be partition tolerant (if application is cut from master it can switch to slave, losing few updates; if master is cut from slave, application doesn't even notice). It will be also available, because none of the systems has to wait in case of failure of other. So, such setup will be AP, not CA if you look at RDBMS part only. In truth, it will be not even P, as application being cut from both databases cannot do much - unless you implement near cache for transactions, but then we are in yet another complication...

I think that best article taking about differences between A and P is here
http://pl.atyp.us/wordpress/?p=2521

Taylor said...

@Stan,

Yes, I like your thinking here, however 2PC isn't traditionally put into play for replication, but for transactions that span multiple resources - your point is otherwise valid.

@Artur - thanks for the good link. I don't agree that an RDBMS requires a replica to be considered a distributed system - for all intents and purposes, all RDBMS' are accessed via a network connection and hence even though you have only 2 actors they are still separated by a network that is susceptible to failure and should be considered a distributed system. This is even more evident when you add multiple clients connecting to a single RDBMS in which case your topology is a star or hub pattern. And as you state, if we introduce a local write-back cache into the client, then the client can survive a partition and we have now changed the overall system from CA to AP (note that a write-through would still be CA).

Nati Shalom said...

@Taylor

Thanks updating the GigaSpaces part.

Can you elaborate what do you mean by:

"The system is consistent and available, but can not tolerate partitions"

Chaker said...

Thanks for this article Taylor.

How would you categorize Redis?

برامج said...

very helpful article i really enjoy it thanks Taylor
الفيسبوك

Tiffany Jewelry said...

This is a great post,which has a lot of readers.
Such kind of post giving us various knowaledge,it worthy to read.It is so wonderful.

aa said...

As I write this post—longhandOffice 2010in a spiral notebook—I’m 20,000 feet above eastern Washington, having Microsoft Office 2010just crossed above the Cascades on my return flight Microsoft wordto Chicago. I visited Seattle for the weekend to Office 2007and I have known each other for 20 years now. They Microsoft Officehad a lovely ceremony, and the trip in general was fantastic.Microsoft Office 2007In the 13 years since I left Seattle, I’ve Office 2007 keyvisited six or seven times, and I always return to wherever has Office 2007 downloadOffice 2007 Professionalbecome home with mixed feelings about the place. It Outlook 2010both alarms and pleases me to see howMicrosoft outlookthat once-familiar areas seem almost foreign. ForMicrosoft outlook 2010neighborhoods have changed, to the point Windows 7 as have cookie-cutter, here-today-and-gone-tomorrow nightclubs that cater to the shiny shirt crowd.

aa said...

As I write this post—longhandOffice 2010in a spiral notebook—I’m 20,000 feet above eastern Washington, having Microsoft Office 2010just crossed above the Cascades on my return flight Microsoft wordto Chicago. I visited Seattle for the weekend to Office 2007and I have known each other for 20 years now. They Microsoft Officehad a lovely ceremony, and the trip in general was fantastic.Microsoft Office 2007In the 13 years since I left Seattle, I’ve Office 2007 keyvisited six or seven times, and I always return to wherever has Office 2007 downloadOffice 2007 Professionalbecome home with mixed feelings about the place. It Outlook 2010both alarms and pleases me to see howMicrosoft outlookthat once-familiar areas seem almost foreign. ForMicrosoft outlook 2010neighborhoods have changed, to the point Windows 7 as have cookie-cutter, here-today-and-gone-tomorrow nightclubs that cater to the shiny shirt crowd.

Jonas Bandi said...

Some of the info here is quite contradictory to the following blog post:

Visual Guide to NoSQL Systems

... and I tend to believe the classification in that post more than the classification found here.

PaRduS said...

Kudos on the recovery

ilaclama
ilaçlama

PaRduS said...

ilaclama
ilaçlama

Adim said...

Thanks. We believe in sharing obviously.

Franchises Logo

Adim said...

Very interesting article. I've always been interested in knowing more about this.

Internet Marketing

Admin said...

Thank you for this helpful stuff I got at your site. The stuff here is really good and keep up sharing.

Ready-Made Logo

jonalyndislag said...

weight loss programs mn

I admire the way you express yourself through writing. Your post is such a refreshing one to read. This is such an interesting and informative article to share with others. Keep up the good work and more power. Thanks

jonalyndislag said...

electronic voting systems

This great blog is very interesting and enjoyable to read. I am a big fan of the subjects discussed. I also enjoy reading the comments, but notice that alot of people should stay on topic to try and add value to the original blog post. I would also encourage everyone to bookmark this page to your favourite service to help spread the word.