Why Distributed Systems Need Consensus Algorithms Like Raft

Arpit Bhayani

curious, tinkerer, and explorer


Everything we interact with today is a “distributed system”. From microservices to cloud-native applications, from databases to message queues. We are constantly building systems that span multiple machines. But there comes a fundamental challenge - how do we ensure that all independent nodes agree on shared state?

This is where consensus algorithms like Raft come in. While I talk about Raft in some other write-up, in this one, let me help you understand why there is a need for consensus and why it is a tough nut to crack.

Why Agreement is Hard

It Does Seem Simple

At first glance, getting multiple machines to agree seems straightforward. Can’t we just have one machine make decisions and tell the others? Or use timestamps to order operations? Unfortunately, distributed systems introduce complexities that make these naive approaches fail catastrophically.

Things That Can Go Wrong

Distributed systems face several simultaneous challenges; here are some of them.

Network Partitions: Networks can split, leaving groups of nodes unable to communicate. Unlike a simple “network down” scenario, partitions create islands of nodes that continue operating independently.

Node Failures: Machines crash, sometimes permanently, sometimes temporarily. They might restart with partial memory loss or a corrupted state.

Asynchronous Communication: Message delivery times are unpredictable. A message sent first might arrive second, or might never arrive at all.

Clock Skew: Different machines have different notions of time, making timestamp-based ordering unreliable.

Majority Vote: Sounds simple, but what happens during network partitions? You might have multiple groups, each thinking they’re the majority.

The Bank Account Problem

Imagine a distributed banking system with account data replicated across three servers in different data centers. A customer has $1000 and tries to withdraw $800. Simultaneously, their spouse tries to withdraw $500 from a different location.

Without proper consensus:

  • Server A processes the $800 withdrawal: balance becomes $200
  • Server B processes the $500 withdrawal: balance becomes $500
  • Server C might see both operations in different orders

You can consider these servers A, B, and C are geographically distributed database nodes, and all of them are capable of accepting and handling writes.

Now you can see why consensus is a legitimate real-world problem.

What Consensus Algorithms Actually Solve

The Core Guarantee: Safety and Liveness

Consensus algorithms (like Paxos or Raft) provide two fundamental guarantees

Safety: Nothing bad ever happens. In practical terms:

  • All nodes that decide on a value decide on the same value
  • No conflicting decisions are made
  • The system maintains consistency even during failures

Liveness: Something good eventually happens. In practical terms:

  • The system continues to make progress
  • Decisions are eventually reached
  • The system doesn’t get permanently stuck

The Consensus Problem Formally

Given a set of nodes in a distributed system, consensus algorithms solve the problem of getting all non-faulty nodes to agree on a single value, even in the presence of failures and network issues.

The algorithm must satisfy:

  1. Agreement: All correct nodes decide on the same value
  2. Validity: If all nodes propose the same value, that value is decided
  3. Termination: All correct nodes eventually decide

Real-World Scenarios: Where Consensus is Critical

Distributed Databases

Leader Election for Primary-Replica Systems

Databases like PostgreSQL with streaming replication need to elect a primary after the current primary fails. Without consensus, you might end up with split-brain scenarios where multiple nodes think they’re the primary.

Distributed Transactions

When a transaction spans multiple database shards, all participants must agree on whether to commit or abort. The two-phase commit protocol is essentially a consensus algorithm.

Configuration Management

Systems like Apache Kafka use consensus (via Zookeeper or KRaft) to manage broker membership, partition assignments, and configuration changes.

Microservices and Service Discovery

Service Registration

When services start up, they need to register with a service discovery system. Multiple discovery nodes must agree on the current set of healthy services.

Configuration Distribution

Changing application configuration across a fleet of microservices requires consensus to ensure all services get the same configuration version.

Container Orchestration

Scheduler Decisions

Kubernetes etcd stores cluster state and scheduling decisions. Multiple master nodes must agree on pod placements and resource allocations.

Node Membership

The cluster must agree on which nodes are healthy and available for scheduling workloads.

Footnotes

Consensus algorithms like Raft are the foundation of reliable distributed systems.

The next time you are building a system that needs to keep state in sync across services or data centers, remember that Raft and similar algorithms have already solved the hardest problems. Your task is to know when and how to use them effectively.

Also, knowing them helps you understand why and how distributed systems are both a beauty and a beast.


If you find this helpful and interesting,

Arpit Bhayani

Staff Engg at GCP Memorystore, Creator of DiceDB, ex-Staff Engg for Google Ads and GCP Dataproc, ex-Amazon Fast Data, ex-Director of Engg. SRE and Data Engineering at Unacademy. I spark engineering curiosity through my no-fluff engineering videos on YouTube and my courses

Writings and Learnings

Blogs

Papershelf

Bookshelf

RSS Feed


Arpit's Newsletter read by 145,000 engineers

Weekly essays on real-world system design, distributed systems, or a deep dive into some super-clever algorithm.


The courses listed on this website are offered by

Relog Deeptech Pvt. Ltd.
203, Sagar Apartment, Camp Road, Mangilal Plot, Amravati, Maharashtra, 444602
GSTIN: 27AALCR5165R1ZF