Wednesday, June 20, 2012

Understanding Cassandra's consistency and conflicts

After a long time, here comes another technical entry into my blog. I have been playing around with Cassandra trying to understand it as a system and one of the things that had often come up in many forums is the difficulty in understanding Cassandra's consistency. In this post, I hope to consolidate what I inferred from various forums, solutions and problems extended by many others who have been working with Cassandra. I intend this post to be useful to someone who has already had a flavor of what Cassandra is about and is familiar with the fundamental concepts of Distributed Systems and data stores. If not, I strongly recommend skimming through some of the basics here before proceeding further with this post.

A different ACID

Consistency in Cassandra doesn't directly relate to the 'C' in ACID. Consistency in traditional database systems refers to transactional consistency which ensures the correctness of the state for a given DB transaction. Whereas, in Cassandra, consistency of data across its replicas. In fact, it would be simpler to view this as the consistency of data that can be observed by an external client or client side consistency.  The consistency that is observed within Cassandra cluster (which might be different from that observed by an external client) can be defined as server side consistency. Cassandra does not provide transactional consistency (across multiple reads/writes) which is traded off for the higher speed and scalability. If essential, transactions and transactional consistency has to be implemented by the client. 

No rollbacks! 

An important implication of the above fact is that a Cassandra cluster (or simply cluster) could have partial writes (or writes in progress) but would not provide a roll-back mechanism for any (potentially) failed operations. For example, consider a Cassandra cluster with 3 nodes (N1, N2, N3), a replication factor (RF) of 3 and Read-Write Consistency Level (CL) of 2. Consider a write to X  is initiated on nodes N1,N2 and node N1 fails while the write is in progress. The write to N2 would succeed, a timeout is reported to client, but the write on N2 is not rolled-back as would have been the case with traditional databases. Fixing this inconsistency by retrying the failed operation or any corrective mechanism is the responsibility of the client. I believe, the only case of a true failure reported by Cassandra is when not enough nodes in the cluster are live for a given operation.  

Eventually consistent

"So what you are saying is, Cassandra cannot provide any consistency guarantees what-so-ever." No - this is a common misconception of many people that I have observed in many a user forum. Cassandra is eventually consistent. Huh!? Okay. Let me put it this way. Cassandra can be as consistent as you want it to be. The condition for strong consistency is 

R + W > N, where 
N - Number of replicas
W - Number of nodes that need to agree for a successful write
R - Number of nodes that need to agree for a successful read

And if R + W <= N, we say that the cluster is configured to have weak consistency or is eventually consistent. For example, consider the Figure 1. The system has a Read consistency (R CL) of quorum and a Write consistency (W CL) of ANY ( at least 1) and is therefore said to be eventually consistent. Since writes can succeed with just one node,  (W3) write 3 to N1 at time T0 and  (W5) write 5 to N2 at time T1 can happen independently to the same variable X. However, we can see from right hand side of the figure that a read  (R CL = quorum) at time T2 can result in different values depending on the set of nodes (N1, N2) or (N1,N3) which are chosen to serve the read request. 

Okay, I see that there can be inconsistency. But will it always remain so? And the answer is No - thanks to the read repairs that happens on the background. In both the cases illustrated here the read repairs (shown in purple) will ensure that subsequent reads will have converged on the same value for X. And this is why we say that Cassandra is eventually consistent.

Strong Consistency

Alright, now what would happen if I were to have R + W > N. Let us consider the extreme case where W CL = ANY and R CL = ALL shown in Figure 2. In this case, for the read to succeed, all replicas need to be in agreement and therefore have to be consistent before we respond back to the client.

Conflict resolution

Hold on, how did you decide that 5 and not 3 is the correct value? I didn't, Cassandra did. To resolve conflicts, all columns in Cassandra has a time stamp associated with it. Since T1 > T0 in our example, 5 becomes a later write and is therefore assumed to be correct. It is therefore evident that the nodes in the Cassandra cluster need to be synchronized in their measure of time to be semantically correct.

I thought Cassandra used vector clocks, no? After going through a number of threads and forums, I realized that this is not true. Vector clocks and version vectors are popular methods used for conflict identification. However, Cassandra already employs a per column time stamp for resolving conflicts thereby obviating the need for a causal ordering that is provided by the vector clocks.

Okay, I have synchronized my clocks. But what if I have a truly concurrent write with the same time stamp? In the unlikely case that you precisely end up with two time stamps that match in its microsecond, you might end up with a bad version but Cassandra ensures that ties are consistently broken by comparing the byte values.

Partial writes

Done. I did my math and made my cluster strongly consistent. Am I safe? The answer is both yes and no. This subtle but interesting scenario comes up when we have failures, which brings in the notion of partial writes. Consider Figure 3 which is the same example as shown before, but at time T1, the node N2 is disconnected from the cluster momentarily due to which the W5 gets timed out. This would mean that the a value of 5 is written to N2 but the write operation is not successful yet as it could not meet the required consistency level. Thereby N2 has the updated value while N1 and N3 have the older values. At a later time T2 (when N2 is back in cluster), the read can give different results based on the nodes which serve the reads.

Wait! Are you saying that we do not have strong consistency? No. We are still running strongly consistent, but there is a little non-determinism in the system. Consider Case 1 where the read goes to (N1,N3) and the value of 3 is returned after nodes N1 and N3 arrive at a consensus. It should be noted that the W5 is still in flight or in a timeout which is being handled by the client. In other words, the write is still in progress. Therefore it is semantically correct that the previous value 3 be returned.  In case 2, the read goes to (N1, N2). Here N2 has a more recent value (time stamp, remember?) and a consensus is reached with N1 before the value 5 is  returned. Now the subtlety - in this process, the W5 which was still in progress gets completed. Since W5 is now complete, the value returned (5),  is semantically the correct one. The read repairs shown by the purple lines happen asynchronously in both cases. So, despite the non-determinism we see in such cases, the semantics of consistency is still maintained and the conditions for strong consistency described earlier hold! 

To summarize, consistency in Cassandra is (a) different from that of transactional consistency (b) can be eventually or strongly consistent. A cliched conclusion to the post, I know, but I hope to have discussed some of the finer aspects that help understand what consistency means in Cassandra. Though some of these design decisions made in Cassandra incur additional effort for the developer, it keeps Cassandra simple and focused on its primary purpose - store and deliver data at blinding speeds. And trust me, it does that!