Designing for Eventual Consistency

Designing software is like solving a puzzle - you have initial clues (i.e., business requirements and top quality goals). You need to find the correct solution out of a multitude of possible choices.

Each choice comes with a downside, and picking the correct one, in the beginning, is critical to project success - it's more costly to change the architecture once the first line of code is written.

More often than not, the choice is not cut and dry, and the final system requires a careful balance of a few contradicting options.

One of the better-known choices when designing for distributed systems is choosing between consistency and availability of data in the event of a network partition, while also accounting for latency.

Revisiting CAP theorem

This choice was first summarized in the CAP theorem, presented by Dr. Eric Brewer in 2000: when choosing between consistency, availability, and tolerance to network partitions, you can have at most two out of three properties for any shared data systems.

Here consistency meant that each system request returns the most up-to-date and accurate data. If data is not available, the system errors out, resulting in low availability.

Availability means that every request returns a response. However, there is no guarantee that the response contains the most current data.

Because very few systems consist of a single node (where database and application code are coexisting), tolerance to network partition is a given, except for the particular use cases. It would be undesirable to have a nonfunctioning system if one node goes offline due to network issues, hardware failures, or software updates.

However, while network partitions are a given, they are rare, with SLA's boasting 99.9% uptime guarantees. Does it mean you can design for both consistency and availability 99% of the time?

PACELC

PACELC theorem, authored by Daniel Abadi in 2012, expands on CAP. It introduced the concept of latency caused by data replication between distributed nodes. It states: if a partition occurs (P), there is a choice between providing Availability (A) or Consistency (C); else (E) the choice is between Latency (L) and Consistency(C).

Data Availability and Replication

High availability assumes some form of data replication. Here are a few reasons:

Geographical proximity. If your system serves users all over the world, it makes sense to reduce network latency by using the closest data storage.
Disaster recovery. If one node experiences a failure (which is a probable event), high availability can only be guaranteed if another node contains a copy of the data.
Real-time reporting or analytics, requiring data replication and aggregation in reporting or analytics databases

Regardless of the replication method, moving data across the network does not happen instantaneously. Depending on the amount of data present, the delay may be significant. In this case, if availability is more critical, latency needs to be reduced, which in turn would lower consistency.

It's important to note that replication is also essential for consistency.

Eventual Consistency

When multiple copies of data are present, they may be identical or different.

Highly available systems follow an optimistic replication strategy that allows for diverging data sets. When divergence happens, that data will reach eventual consistency - meaning that at some point in time, all pending updates will be applied to the data, and replicas will converge. Eventual consistency does not guarantee when convergence will occur.

In contrast, pessimistic replication attempts to enforce exact math between all replicas.

Eventual Consistency Examples

Offline processing - allowing changes to data replica

Back in the early days of mobile phones, I've used to write applications for Blackberries which customer base consisted of traveling reps. The wireless connectivity was lacking, especially in the middle of the farm fields - and without it, reps didn't have any way of pulling their calendar to review the next appointment or get the contact information of nearby customers. In a way, network partition was a guaranteed event, and availability was crucial - without it, the application was useless.

Before the trip, the mobile app would download the subset of the master data based on the desired geographical location (to save space and transfer time). If connectivity loss were detected, the mobile device would switch pulling data from the local storage instead of trying to contact the server. The rep could view, edit, and add data while working with the local copy.

While data was available, it was not consistent. Connected users could have modified the records they were looking at, and there was no way of detecting the change. New appointments entered in disconnected mode could have conflicted with meetings created in connected mode. More extended the trip through the fields resulted in further diverged datasets.

When the rep entered the connected zone, that would trigger a sync event. Changes done to the record during the trip were reconciled with the master record, achieving consistency between the primary dataset and their local subset.

DNS changes propagation - synching local replicas

Make any changes to the domain, and the alert will inform you that it may take up to 24 hours for changes to take effect. In most cases, the changes can be applied almost instantaneously, but sometimes the cache needs to be purged and rebuild first. Eventually, the data will be synchronized across all DNS servers.

IoT sync up - updating the primary copy

If few IoT sensors lose network connection, the data may still be collected and stored locally (to the point, of course, as the storage capabilities of IoT are limited). The central system still displays aggregated data, which eventually will be updated with missing data points when the network connection is restored.

Going beyond the data

The choices between consistency and availability (latency) apply to your entire system.

For example, what should happen during software updates?

Let's say you have a network of IoT. For various reasons, some of them may be disconnected from the network.

When the software update is pushed, those units will not receive it. The software update may contain changes to the data collection schema. Should you choose to enforce consistency (i.e., force device to be connected and updated before it can operate) or availability (allowing functionality to continue in disconnected mode on outdated software with the understanding that eventually the schema will be identical across all devices)?

Choosing availability, in this case, will trigger other choices (for example, ensuring backward capability).

Design approach thoughts

The same system may have a mix of approaches. For example, you may decide that the authorization module must favor consistency over availability. In contrast, you may choose to make the module responsible for providing weather information or stock quotes highly available and eventually consistent.
Selecting a technical stack that works with the desired approach makes implementation significantly more straightforward. Choosing the ACID database engine will give up availability and latency in favor of consistency, while BASE databases will sacrifice consistency in the event of a partition event.
Provide valuable reference points in the UI by displaying timestamps(for example, "Data as of" or "Updated at"). That helps to inform the user that data may be potentially stale.
When the user creates a new record and a network partition event occurs, display local copy instead of going back to the server until the data can be reconciled.

Making a choice

High availability presents a risk that stale data may be too diverged from the reality; that it would become difficult or impossible to reconcile without conflicts - or that important data points will be missing.

However, not providing availability may result in an even higher risk. The choice here is less technical and must be driven by business requirements and risk assessment.