Wednesday, June 20, 2012

Understanding Cassandra's consistency and conflicts

After a long time, here comes another technical entry into my blog. I have been playing around with Cassandra trying to understand it as a system and one of the things that had often come up in many forums is the difficulty in understanding Cassandra's consistency. In this post, I hope to consolidate what I inferred from various forums, solutions and problems extended by many others who have been working with Cassandra. I intend this post to be useful to someone who has already had a flavor of what Cassandra is about and is familiar with the fundamental concepts of Distributed Systems and data stores. If not, I strongly recommend skimming through some of the basics here before proceeding further with this post.

A different ACID

Consistency in Cassandra doesn't directly relate to the 'C' in ACID. Consistency in traditional database systems refers to transactional consistency which ensures the correctness of the state for a given DB transaction. Whereas, in Cassandra, consistency of data across its replicas. In fact, it would be simpler to view this as the consistency of data that can be observed by an external client or client side consistency.  The consistency that is observed within Cassandra cluster (which might be different from that observed by an external client) can be defined as server side consistency. Cassandra does not provide transactional consistency (across multiple reads/writes) which is traded off for the higher speed and scalability. If essential, transactions and transactional consistency has to be implemented by the client. 

No rollbacks! 

An important implication of the above fact is that a Cassandra cluster (or simply cluster) could have partial writes (or writes in progress) but would not provide a roll-back mechanism for any (potentially) failed operations. For example, consider a Cassandra cluster with 3 nodes (N1, N2, N3), a replication factor (RF) of 3 and Read-Write Consistency Level (CL) of 2. Consider a write to X  is initiated on nodes N1,N2 and node N1 fails while the write is in progress. The write to N2 would succeed, a timeout is reported to client, but the write on N2 is not rolled-back as would have been the case with traditional databases. Fixing this inconsistency by retrying the failed operation or any corrective mechanism is the responsibility of the client. I believe, the only case of a true failure reported by Cassandra is when not enough nodes in the cluster are live for a given operation.  

Eventually consistent

"So what you are saying is, Cassandra cannot provide any consistency guarantees what-so-ever." No - this is a common misconception of many people that I have observed in many a user forum. Cassandra is eventually consistent. Huh!? Okay. Let me put it this way. Cassandra can be as consistent as you want it to be. The condition for strong consistency is 

R + W > N, where 
N - Number of replicas
W - Number of nodes that need to agree for a successful write
R - Number of nodes that need to agree for a successful read

And if R + W <= N, we say that the cluster is configured to have weak consistency or is eventually consistent. For example, consider the Figure 1. The system has a Read consistency (R CL) of quorum and a Write consistency (W CL) of ANY ( at least 1) and is therefore said to be eventually consistent. Since writes can succeed with just one node,  (W3) write 3 to N1 at time T0 and  (W5) write 5 to N2 at time T1 can happen independently to the same variable X. However, we can see from right hand side of the figure that a read  (R CL = quorum) at time T2 can result in different values depending on the set of nodes (N1, N2) or (N1,N3) which are chosen to serve the read request. 

Okay, I see that there can be inconsistency. But will it always remain so? And the answer is No - thanks to the read repairs that happens on the background. In both the cases illustrated here the read repairs (shown in purple) will ensure that subsequent reads will have converged on the same value for X. And this is why we say that Cassandra is eventually consistent.

Strong Consistency

Alright, now what would happen if I were to have R + W > N. Let us consider the extreme case where W CL = ANY and R CL = ALL shown in Figure 2. In this case, for the read to succeed, all replicas need to be in agreement and therefore have to be consistent before we respond back to the client.

Conflict resolution

Hold on, how did you decide that 5 and not 3 is the correct value? I didn't, Cassandra did. To resolve conflicts, all columns in Cassandra has a time stamp associated with it. Since T1 > T0 in our example, 5 becomes a later write and is therefore assumed to be correct. It is therefore evident that the nodes in the Cassandra cluster need to be synchronized in their measure of time to be semantically correct.

I thought Cassandra used vector clocks, no? After going through a number of threads and forums, I realized that this is not true. Vector clocks and version vectors are popular methods used for conflict identification. However, Cassandra already employs a per column time stamp for resolving conflicts thereby obviating the need for a causal ordering that is provided by the vector clocks.

Okay, I have synchronized my clocks. But what if I have a truly concurrent write with the same time stamp? In the unlikely case that you precisely end up with two time stamps that match in its microsecond, you might end up with a bad version but Cassandra ensures that ties are consistently broken by comparing the byte values.

Partial writes

Done. I did my math and made my cluster strongly consistent. Am I safe? The answer is both yes and no. This subtle but interesting scenario comes up when we have failures, which brings in the notion of partial writes. Consider Figure 3 which is the same example as shown before, but at time T1, the node N2 is disconnected from the cluster momentarily due to which the W5 gets timed out. This would mean that the a value of 5 is written to N2 but the write operation is not successful yet as it could not meet the required consistency level. Thereby N2 has the updated value while N1 and N3 have the older values. At a later time T2 (when N2 is back in cluster), the read can give different results based on the nodes which serve the reads.

Wait! Are you saying that we do not have strong consistency? No. We are still running strongly consistent, but there is a little non-determinism in the system. Consider Case 1 where the read goes to (N1,N3) and the value of 3 is returned after nodes N1 and N3 arrive at a consensus. It should be noted that the W5 is still in flight or in a timeout which is being handled by the client. In other words, the write is still in progress. Therefore it is semantically correct that the previous value 3 be returned.  In case 2, the read goes to (N1, N2). Here N2 has a more recent value (time stamp, remember?) and a consensus is reached with N1 before the value 5 is  returned. Now the subtlety - in this process, the W5 which was still in progress gets completed. Since W5 is now complete, the value returned (5),  is semantically the correct one. The read repairs shown by the purple lines happen asynchronously in both cases. So, despite the non-determinism we see in such cases, the semantics of consistency is still maintained and the conditions for strong consistency described earlier hold! 

To summarize, consistency in Cassandra is (a) different from that of transactional consistency (b) can be eventually or strongly consistent. A cliched conclusion to the post, I know, but I hope to have discussed some of the finer aspects that help understand what consistency means in Cassandra. Though some of these design decisions made in Cassandra incur additional effort for the developer, it keeps Cassandra simple and focused on its primary purpose - store and deliver data at blinding speeds. And trust me, it does that! 


Phil said...

I disagree that you can classify Cassandra as strongly consistent if it can't tolerate failures, especially as a distributed environment.

The label of providing strongly consistent data should only be used on systems which use a proven consensus algorithm such as Google's Spanner using Paxos.

That being said, good summary.

Mighty Titan said...

A very valid point. The fundamental point at which this classification breaks is when Cassandra leaks a failed write.

The definition of consistency is an absolute mess with many systems and implementations coming up with their own nomenclature and definitions. The closest fit for cassandra's model that I could find is per-key sequential consistency (only when r + w > n). That the authors of cassandra claim eventual consistency is based on this definition - A system is eventually consistent if at point of time the writes were stopped and ALL the reads after a certain interval of time (read eventually) observe the value written by the last write.

Phil said...

I think even per-key sequential consistency is too strong of a classification, although it might behave that way under good conditions.

If you W=1 and R=all, but the node that hold's the new write fails, you will have a permutation of operations that isn't sequential anymore.

I agree that the system is eventually consistent, but I also believe with it's current architecture that's the strongest guarantee that the system can provide.

Well now that I think about it some more, perhaps if you use W=R=quorum then that might tolerate a few failures, but I still think a real consensus algorithm needs to be used to provide more than eventually consistency.

Mighty Titan said...

Absolutely. As one of my example shows, a failed write that leaks a value can result in non-deterministic read value for until a subsequent read correction corrects it.

I'm curious why you'd think the per-key consistency breaks when W=1 and R=N, since ANY read will ensure that the values are consistent (and latest too) before responding back.

While I agree that a consensus algorithm needs to provide better guarantees, I think Cassandra was designed in a different spirit, trading off stronger guarantees for better speed and simplicity. It is interesting that you compared this with Spanner (I'm a big fan of this particular work of Google!). While Spanner's consistency model is quite robust, instinctively, I do not think it can perform as fast or scale as seamlessly as a cassandra cluster (despite the neat trick of co-location using their tablets).

Phil said...

I agree that the motivation behind the systems are very different, I'm just currently thinking about the true consistency guarantees of the system and potentially modifying it for a research project. Maybe a way to combine the speed of Cassandra while occasionally offering a stronger guarantee.

I'm still learning the architecture and behavior of Cassandra, so excuse my misunderstanding if this situation is not possible...

I was picturing a scenario where a few of the nodes die, so the other nodes need to replicate the particular object back to N.

However the system won't know exactly when the failed replicas resume, which might introduce N+1 or so replicas of the object in the system. Then my previously stated situation would occur when the client is only aware of N nodes, writes to 1, fails, and then hears back from the other N.

How does Cassandra handle replication when nodes go down?

Mighty Titan said...

A very nice write up on the various definitions of consistency - from the horse's mouth!

Thanks Dr.Vogels !

Divit said...

Excellent blogs!!!!you have for sharing them effect information..we developer very learning to easy

Cassandra Training Courses