Why Distributed Systems Need Consensus Algorithms Like Raft

curious, tinkerer, and explorer

Everything we interact with today is a “distributed system”. From microservices to cloud-native applications, from databases to message queues. We are constantly building systems that span multiple machines. But there comes a fundamental challenge - how do we ensure that all independent nodes agree on shared state?

This is where consensus algorithms like Raft come in. While I talk about Raft in some other write-up, in this one, let me help you understand why there is a need for consensus and why it is a tough nut to crack.

Why Agreement is Hard

It Does Seem Simple

At first glance, getting multiple machines to agree seems straightforward. Can’t we just have one machine make decisions and tell the others? Or use timestamps to order operations? Unfortunately, distributed systems introduce complexities that make these naive approaches fail catastrophically.

Things That Can Go Wrong

Distributed systems face several simultaneous challenges; here are some of them.

Network Partitions: Networks can split, leaving groups of nodes unable to communicate. Unlike a simple “network down” scenario, partitions create islands of nodes that continue operating independently.

Node Failures: Machines crash, sometimes permanently, sometimes temporarily. They might restart with partial memory loss or a corrupted state.

Asynchronous Communication: Message delivery times are unpredictable. A message sent first might arrive second, or might never arrive at all.

Clock Skew: Different machines have different notions of time, making timestamp-based ordering unreliable.

Majority Vote: Sounds simple, but what happens during network partitions? You might have multiple groups, each thinking they’re the majority.

The Bank Account Problem

Imagine a distributed banking system with account data replicated across three servers in different data centers. A customer has $1000 and tries to withdraw $800. Simultaneously, their spouse tries to withdraw $500 from a different location.

Without proper consensus:

Server A processes the $800 withdrawal: balance becomes $200
Server B processes the $500 withdrawal: balance becomes $500
Server C might see both operations in different orders

You can consider these servers A, B, and C are geographically distributed database nodes, and all of them are capable of accepting and handling writes.

Now you can see why consensus is a legitimate real-world problem.

What Consensus Algorithms Actually Solve

The Core Guarantee: Safety and Liveness

Consensus algorithms (like Paxos or Raft) provide two fundamental guarantees

Safety: Nothing bad ever happens. In practical terms:

All nodes that decide on a value decide on the same value
No conflicting decisions are made
The system maintains consistency even during failures

Liveness: Something good eventually happens. In practical terms:

The system continues to make progress
Decisions are eventually reached
The system doesn’t get permanently stuck

The Consensus Problem Formally

Given a set of nodes in a distributed system, consensus algorithms solve the problem of getting all non-faulty nodes to agree on a single value, even in the presence of failures and network issues.

The algorithm must satisfy:

Agreement: All correct nodes decide on the same value
Validity: If all nodes propose the same value, that value is decided
Termination: All correct nodes eventually decide

Real-World Scenarios: Where Consensus is Critical

Distributed Databases

Leader Election for Primary-Replica Systems

Databases like PostgreSQL with streaming replication need to elect a primary after the current primary fails. Without consensus, you might end up with split-brain scenarios where multiple nodes think they’re the primary.

Distributed Transactions

When a transaction spans multiple database shards, all participants must agree on whether to commit or abort. The two-phase commit protocol is essentially a consensus algorithm.

Configuration Management

Systems like Apache Kafka use consensus (via Zookeeper or KRaft) to manage broker membership, partition assignments, and configuration changes.

Microservices and Service Discovery

Service Registration

When services start up, they need to register with a service discovery system. Multiple discovery nodes must agree on the current set of healthy services.

Configuration Distribution

Changing application configuration across a fleet of microservices requires consensus to ensure all services get the same configuration version.

Container Orchestration

Scheduler Decisions

Kubernetes etcd stores cluster state and scheduling decisions. Multiple master nodes must agree on pod placements and resource allocations.

Node Membership

The cluster must agree on which nodes are healthy and available for scheduling workloads.

Footnotes

Consensus algorithms like Raft are the foundation of reliable distributed systems.

The next time you are building a system that needs to keep state in sync across services or data centers, remember that Raft and similar algorithms have already solved the hardest problems. Your task is to know when and how to use them effectively.

Also, knowing them helps you understand why and how distributed systems are both a beauty and a beast.

If you find this helpful and interesting,

share it on HackerNews
subscribe to my RSS feed and get notified the moment I publish a new one.

Staff Engg at GCP Memorystore, Creator of DiceDB, ex-Staff Engg for Google Ads and GCP Dataproc, ex-Amazon Fast Data, ex-Director of Engg. SRE and Data Engineering at Unacademy. I spark engineering curiosity through my no-fluff engineering videos on YouTube and my courses