Apache Cassandra is a distributed database designed for high availability and horizontal scalability. This write-up explores the complete write path in Cassandra, from the moment a client sends a write request to how data gets replicated across nodes in the cluster.
When a client writes data to Cassandra, the data flows through multiple stages before being safely stored. The write path involves several key components working together to ensure durability, consistency, and performance. Let’s break this down…
Client Connection and Coordinator Selection
Every write request in Cassandra starts with a client connecting to any node in the cluster. The node that receives the client request becomes the coordinator for that operation. This is an important concept: there is no “master” node in Cassandra—any node can coordinate any request.
The coordinator’s role is to:
- Accept the write request from the client
- Determine which nodes should store replicas of the data
- Forward the write to those replica nodes
- Wait for acknowledgments based on the consistency level
- Respond to the client
The coordinator doesn’t necessarily store the data itself, though it might be one of the replica nodes depending on the partition key and replication strategy.
Determining Replica Nodes
Before the coordinator can forward writes, it needs to determine which nodes should store the data. This decision is based on three key factors.
Partition Key and Token
Every piece of data in Cassandra has a partition key. This key is hashed using a partitioner (typically Murmur3Partitioner) to produce a token—a 64-bit integer that determines where the data lives in the cluster.
Partition Key → Hash Function → Token → Node(s)
For example, if you have a table storing user data:
CREATE TABLE users (
user_id uuid PRIMARY KEY,
name text,
email text
);
When you insert a user with user_id = '550e8400-e29b-41d4-a716-446655440000', Cassandra hashes this UUID to produce a token. This token falls within a range owned by specific nodes in the cluster.
Token Ring and Consistent Hashing
Cassandra organizes nodes in a logical token ring. The entire token space (from -2^63 to 2^63-1 for Murmur3) is divided among the nodes in the cluster. Each node is assigned a token range and is responsible for storing data whose tokens fall within that range.
When you add or remove nodes, the token ranges are redistributed, but only a fraction of the data needs to move—this is the beauty of consistent hashing.
Replication Strategy
The replication strategy determines how many copies of the data exist and where they’re placed. There are two main strategies:
SimpleStrategy: Used for single data center deployments. It places replicas on consecutive nodes in the ring. For example, with a replication factor of 3, the data is stored on the first node determined by the token, plus the next two nodes clockwise in the ring.
NetworkTopologyStrategy: Used for multi-data center deployments. It allows you to specify how many replicas should exist in each data center. For example:
CREATE KEYSPACE my_keyspace WITH replication = {
'class': 'NetworkTopologyStrategy',
'datacenter1': 3,
'datacenter2': 2
};
This creates 3 replicas in datacenter1 and 2 in datacenter2. Within each data center, replicas are placed on different racks when possible to maximize availability.
The Write Path Within a Node
Once the coordinator determines which nodes should receive the write, it forwards the mutation to those nodes. Now let’s see what happens inside each replica node when it receives a write request.
CommitLog
The very first thing that happens when a node receives a write is that it appends the mutation to the CommitLog. This is a critical step for durability.
The CommitLog is an append-only log file on disk. It’s structured as a sequential write, which is extremely fast—modern SSDs can handle hundreds of thousands of sequential writes per second. The CommitLog entry contains:
- The keyspace and table name
- The partition key
- The clustering keys (if any)
- The column values
- Timestamp and TTL information
By writing to the CommitLog first, Cassandra ensures that even if the node crashes immediately after, the write can be recovered when the node restarts. This is the Write-Ahead Log (WAL) pattern.
CommitLog configuration considerations:
# In cassandra.yaml
commitlog_sync: batch
commitlog_sync_batch_window_in_ms: 2
commitlog_segment_size_in_mb: 32
commitlog_compression:
- class_name: LZ4Compressor
The commitlog_sync parameter has two modes:
- periodic: Syncs to disk every N milliseconds (default 10s). Faster but less durable.
- batch: Syncs after collecting writes for N milliseconds (default 2ms). More durable.
Most production systems use batch mode with a 2ms window, providing a good balance between durability and performance.
MemTable
Immediately after writing to the CommitLog, the mutation is written to an in-memory structure called a MemTable. Each table in Cassandra has its own MemTable.
The MemTable is essentially a sorted map structure (similar to a Red-Black tree or Skip List) that keeps data sorted by partition key and clustering columns. This sorting is crucial for efficient reads and for the later flush to disk.
Multiple writes to the same partition key will update the MemTable in place. However, Cassandra doesn’t actually update data—it writes new timestamped versions. When you “update” a column, you’re really adding a new entry with a newer timestamp. The reconciliation happens at read time.
Example of MemTable organization:
Partition Key: user_123
├─ Clustering: timestamp=2025-01-15T10:00:00, value="xxxx"
├─ Clustering: timestamp=2025-01-15T10:05:00, value="xxxxxxxx"
└─ Clustering: timestamp=2025-01-15T10:10:00, value="xxxxxx"
Partition Key: user_456
├─ Clustering: timestamp=2025-01-15T10:02:00, value="xxxxxxxxxx"
└─ Clustering: timestamp=2025-01-15T10:12:00, value="xxxxxxxxx"
Write Response
Once the write is in the CommitLog and MemTable, the node sends an acknowledgment back to the coordinator. This happens very quickly, typically in microseconds, because it only involves an append to the CommitLog and an in-memory update.
At this point, the write is considered successful from the replica node’s perspective, but the coordinator hasn’t responded to the client yet. That depends on the consistency level.
Flushing MemTables to SSTables
The MemTable is memory-bound, so it can’t grow indefinitely. When a MemTable reaches a certain size threshold or after a certain time period, it’s flushed to disk as an SSTable (Sorted String Table).
The flush process:
- The MemTable is marked as immutable (no new writes)
- A new MemTable is created for incoming writes
- The immutable MemTable is sorted and written to disk as an SSTable
- The corresponding CommitLog segments are marked for deletion
SSTables are immutable once written. This immutability provides several benefits:
- No need for read-write locks
- Simplified backup and recovery
- Efficient sequential disk access
- Easy to distribute across nodes
However, immutability means we accumulate multiple SSTables over time, which is why compaction is necessary.
Consistency Levels
The consistency level determines how many replica nodes must acknowledge a write before the coordinator responds to the client. This is configurable per query and represents a fundamental trade-off between consistency, availability, and latency.
ONE
The coordinator waits for acknowledgment from just one replica node. This provides the lowest latency and highest availability - the write succeeds as long as any single replica node is reachable. However, you might read stale data if you subsequently read from a node that hasn’t received the write yet.
QUORUM
The coordinator waits for acknowledgments from a majority of replica nodes. For a replication factor of 3, QUORUM means 2 nodes must acknowledge. This is the most commonly used level in production because it provides a good balance:
- QUORUM = floor(replication_factor / 2) + 1
- Guarantees that reads and writes overlap (if you read with QUORUM, you’ll see the most recent QUORUM write)
- Can tolerate the failure of a minority of replicas
LOCAL_QUORUM
Similar to QUORUM, but only counts replicas in the local data center. This is crucial for multi-datacenter deployments because you don’t want to wait for cross-datacenter network latency on every write.
ALL
The coordinator waits for all replica nodes to acknowledge. This provides the strongest consistency but has the worst availability; if even one replica is down, writes fail. Generally not recommended for production.
ANY
This is an interesting case. The write is considered successful if it can be written to at least one node, or if it can be stored as a hint for a temporarily unavailable node. More on hints below.
Here’s an Example
Consider a cluster with a replication factor of 3, and you issue a write with consistency level QUORUM:
# Using the Python driver
from cassandra.cluster import Cluster
from cassandra import ConsistencyLevel
cluster = Cluster(['10.0.0.1'])
session = cluster.connect('my_keyspace')
# Prepare a statement with QUORUM consistency
statement = session.prepare("""
INSERT INTO users (user_id, name, email)
VALUES (?, ?, ?)
""")
statement.consistency_level = ConsistencyLevel.QUORUM
# Execute the write
session.execute(statement, (user_id, "Alice", "alice@example.com"))
Timeline:
- Client sends write to Node A (coordinator)
- Node A determines replicas: Nodes B, C, and D
- Node A forwards write to Nodes B, C, and D
- Within 1-2ms, Nodes B and C respond (they’ve written to CommitLog and MemTable)
- Node A has 2 acknowledgments (QUORUM satisfied)
- Node A responds success to the client
- Node D responds later (but the coordinator already replied to the client)
The entire operation typically completes in 2-5ms, depending on network latency and load.
Data Replication
Now let’s dive deeper into how data actually gets replicated across nodes. The replication happens synchronously as part of the write operation; there’s no separate background replication process for normal writes.
When the coordinator receives a write, it immediately forwards the mutation to all replica nodes (as determined by the replication strategy and replication factor). These writes happen in parallel. The coordinator just waits as per the consistency level, but it still writes to all the replicas synchronously.
Here’s what actually gets sent over the network:
Mutation Message:
- Keyspace: "my_keyspace"
- Table: "users"
- Partition Key: user_id = UUID('...')
- Timestamp: 1705320000000000 (microseconds since epoch)
- Columns:
- name: "Alice"
- email: "alice@example.com"
- TTL: null
Each replica node processes this mutation independently through its own CommitLog → MemTable path.
Timestamps and Conflict Resolution
Cassandra uses last-write-wins (LWW) conflict resolution based on timestamps. Every mutation includes a timestamp (in microseconds), and when multiple versions of the same data exist, the one with the highest timestamp wins.
Timestamps can come from two sources:
- Client-provided timestamps: Using the USING TIMESTAMP clause:
INSERT INTO users (user_id, name) VALUES (?, ?)
USING TIMESTAMP 1705320000000000;
- Coordinator-generated timestamps: If the client doesn’t provide one, the coordinator assigns a timestamp based on its local clock.
If your cluster’s clocks are not synchronized (using NTP), you can get unexpected results. A write with timestamp 1000 can be overwritten by a later write with timestamp 999 if it arrives from a node with a clock that’s behind.
Hence, always use NTP to keep clocks synchronized across your cluster, with clock drift kept under 500ms.
Handling Network Partitions
In a distributed system, network partitions are inevitable. Cassandra handles these gracefully through its consistency model and hinted handoff mechanism.
Suppose we have RF=3 (replication factor), CL=QUORUM, and Node C goes down:
- Coordinator forwards write to Nodes A, B, and C
- Nodes A and B respond successfully
- Node C doesn’t respond (it’s down)
- Coordinator has 2 responses = QUORUM satisfied
- Write succeeds from the client’s perspective
- Coordinator stores a hint for Node C
Hinted Handoff
A hint is a special write that says, “when Node C comes back online, replay this mutation to it.” Hints are stored on the coordinator (or other available nodes) in a system table.
Hint structure:
- Target node: C
- Mutation: INSERT INTO users (user_id, name) VALUES (...)
- Expiration: 3 hours (configurable via max_hint_window_in_ms)
When Node C comes back online, other nodes detect it and start replaying hints to it. This brings Node C up to date with the writes it missed while down.
Important limitations of hints:
- Hints are best-effort, not guaranteed
- If a node is down for longer than
max_hint_window_in_ms(default 3 hours), hints expire - Hints increase the load on the coordinator
- For extended outages, we run a repair instead
Read Repair
Even with a hinted handoff, replicas can get out of sync. Cassandra uses read repair to detect and fix inconsistencies during read operations.
When you read with a consistency level higher than ONE, the coordinator sends read requests to multiple replicas and compares their responses. If it finds differences, it:
- Returns the most recent data to the client (highest timestamp wins)
- Sends the most recent data to replicas with stale data
- Those replicas update themselves in the background
This happens transparently and helps maintain eventual consistency. We can also enable read_repair_chance for additional background repair:
ALTER TABLE users WITH read_repair_chance = 0.1;
This tells Cassandra to perform read repair on 10% of reads, even if the consistency level is ONE. However, this feature is often disabled in modern Cassandra (3.0+) in favor of explicit repair operations.
Write Performance Characteristics
Write Throughput
Cassandra is write-optimized. A single node can handle 10,000-50,000 writes per second, depending on:
- Hardware (SSD vs HDD, CPU cores, RAM)
- Data model (wide vs narrow partitions)
- Consistency level (ONE vs QUORUM vs ALL)
- Batch size (single writes vs batched writes)
The write path is designed for speed:
- Sequential CommitLog writes (no random I/O)
- In-memory MemTable updates (nanosecond latency)
- Parallel replication (no waiting for sequential replication)
Write Latency
Typical write latencies (p99):
- Consistency Level ONE: 1-3ms
- Consistency Level QUORUM: 2-5ms
- Consistency Level LOCAL_QUORUM: 2-5ms (within a datacenter)
- Consistency Level ALL: 5-20ms (depends on slowest node)
These are remarkably consistent because writes don’t involve disk reads—only sequential appends to the CommitLog.
Factors Affecting Write Performance
- Consistency Level
Higher consistency levels increase latency because the coordinator must wait for more acknowledgments.
- Batch Writes
Cassandra supports batches, but they’re not always a performance win:
BEGIN BATCH
INSERT INTO users (user_id, name) VALUES (uuid1, 'Alice');
INSERT INTO users (user_id, name) VALUES (uuid2, 'Bob');
INSERT INTO users (user_id, name) VALUES (uuid3, 'Charlie');
APPLY BATCH;
Logged batches (default) use a batch log for atomicity, which adds overhead. They’re useful for ensuring multiple writes succeed or fail together, but they’re slower than individual writes.
Unlogged batches skip the batch log and are faster, but they don’t guarantee atomicity. They’re only useful for performance when writing to the same partition:
BEGIN UNLOGGED BATCH
INSERT INTO user_events (user_id, timestamp, event) VALUES (uuid1, t1, 'login');
INSERT INTO user_events (user_id, timestamp, event) VALUES (uuid1, t2, 'purchase');
INSERT INTO user_events (user_id, timestamp, event) VALUES (uuid1, t3, 'logout');
APPLY BATCH;
- Partition Size
Writing to the same partition repeatedly can create “hot spots.” Cassandra performs best when writes are distributed across many partitions. If you have a counter table that updates the same partition millions of times, you’ll experience performance degradation.
- Hardware
- SSD vs HDD: SSDs provide much better write performance (CommitLog especially)
- RAM: Larger MemTables reduce flush frequency
- Network: Low-latency networks reduce replication overhead
Compaction
As SSTables accumulate, read performance degrades (you have to check more files) and disk space increases (duplicate and deleted data). Compaction solves this by merging SSTables and removing deleted or overwritten data.
Compaction Strategies
SizeTieredCompactionStrategy (STCS)
- Default strategy
- Groups SSTables of similar size and merges them
- Good for write-heavy workloads
- Can create large temporary disk spikes (requires 50% free space)
LeveledCompactionStrategy (LCS)
- Organizes SSTables into levels of increasing size
- Each level is 10x larger than the previous
- Better read performance (fewer SSTables per read)
- More I/O intensive (more frequent compactions)
- Good for read-heavy workloads
TimeWindowCompactionStrategy (TWCS)
- Designed for time-series data
- Groups SSTables by time window (e.g., daily, hourly)
- Old windows are never compacted with newer ones
- Perfect for time-series data with TTL
- Very efficient for workloads where old data expires
Impact on Write Performance
Compaction runs in the background but competes for I/O resources. Heavy compaction can temporarily impact write performance. You can tune compaction with:
# In cassandra.yaml
compaction_throughput_mb_per_sec: 16 # Limit compaction I/O
concurrent_compactors: 2 # Number of parallel compactions
Multi-Datacenter Replication
Cassandra deployments often span multiple datacenters for disaster recovery and geographical distribution. This is how replication works across datacenters.
NetworkTopologyStrategy Configuration
CREATE KEYSPACE global_app WITH replication = {
'class': 'NetworkTopologyStrategy',
'us-east': 3,
'us-west': 3,
'eu-west': 2
};
This creates:
- 3 replicas in us-east datacenter
- 3 replicas in us-west datacenter
- 2 replicas in eu-west datacenter
- Total of 8 replicas globally
Cross-Datacenter Write Flow
When a write arrives at a coordinator in us-east with consistency level LOCAL_QUORUM:
- The coordinator determines all 8 replica nodes
- 3 in us-east
- 3 in us-west
- 2 in eu-west
- Coordinator immediately forwards the write to all 8 nodes in parallel
- Coordinator waits for 2 acknowledgments from us-east replicas (LOCAL_QUORUM for RF=3)
- Coordinator responds to client (~5ms total)
- Acknowledgments from us-west and eu-west arrive later (~50-200ms depending on distance)
- All replicas now have the data
Cross-datacenter replication is still synchronous (the write is sent immediately), but LOCAL_QUORUM doesn’t wait for remote datacenters. This provides consistency within a datacenter and eventual consistency across datacenters.
Consistency Level Choice in Multi-DC
LOCAL_QUORUM: Most common for production
- Fast (doesn’t wait for remote DCs)
- Consistent within the local DC
- Eventually consistent across DCs
- Can tolerate an entire DC failure
EACH_QUORUM: Stronger consistency
- Waits for QUORUM in every DC
- Much higher latency (wait for all DCs)
- Guarantees strong consistency globally
- Use when you need immediate global consistency
Failure Scenarios and Handling
Scenario 1: Single Node Failure (RF=3, CL=QUORUM)
Cluster with nodes A, B, C, D. Node C fails. RF=3, so data has replicas on A, B, and C.
Write behavior:
- Coordinator sends write to A, B, C
- A and B respond (QUORUM=2 satisfied)
- C doesn’t respond
- Write succeeds
- Hint stored for C
No impact on write availability. Writes continues successfully.
Scenario 2: Multiple Node Failures (RF=3, CL=QUORUM)
Same cluster, nodes B and C both fail.
Write behavior:
- Coordinator sends write to A, B, C
- Only A responds
- QUORUM not satisfied (need 2, have 1)
- Write fails with Unavailable exception
Write availability is lost for data replicated to B and C. However, data replicated to other sets of nodes (e.g., A, D, E) continues working.
Scenario 3: Network Partition
Cluster split into two groups: {A, B} and {C, D}. RF=3, CL=QUORUM.
Write behavior:
- Group {A, B}: Can’t reach QUORUM (need 2, but C is unreachable) → Writes fail
- Group {C, D}: Same problem → Writes fail
- Result: Write availability lost for affected partitions
When the network heals, read repair and explicit repair (nodetool repair) restore consistency.
Scenario 4: Datacenter Failure (Multi-DC)
Setup: Three datacenters: US (RF=3), EU (RF=3), ASIA (RF=2). US datacenter goes offline.
Write behavior with LOCAL_QUORUM in EU:
- Coordinator in EU
- Sends writes to EU replicas
- Gets QUORUM from EU replicas
- Write succeeds
No impact on EU writes. US writes fail. ASIA applications can continue if they write to EU or ASIA with LOCAL_QUORUM.
Common Pitfalls
Using ALL Consistency Level
Problem: Sacrifices availability for consistency. Any single node failure causes writes to fail.
Solution: Use QUORUM or LOCAL_QUORUM instead. They provide strong consistency while tolerating failures.
Large Batches
Problem: Batching unrelated writes hurts performance and can overwhelm coordinators.
-- BAD: Batching writes to different partitions
BEGIN BATCH
INSERT INTO users (user_id, name) VALUES (uuid1, 'Alice');
INSERT INTO products (product_id, name) VALUES (uuid2, 'Widget');
INSERT INTO orders (order_id, total) VALUES (uuid3, 100);
APPLY BATCH;
Solution: Only batch writes to the same partition, or use unlogged batches when atomicity isn’t needed.
Using Client Timestamps Inconsistently
Problem: Mixing client-side and server-side timestamps leads to unpredictable conflict resolution.
Solution: Either always use USING TIMESTAMP or never use it. Be consistent across your application.
Ignoring NTP
Problem: Clock drift causes last-write-wins to produce unexpected results.
Solution: Keep NTP running on all nodes with drift under 500ms. Monitor clock sync regularly.
Writing to Hot Partitions
Problem: Repeatedly updating the same partition creates a bottleneck.
-- BAD: Global counter
UPDATE stats SET page_views = page_views + 1 WHERE stat_name = 'homepage';
Solution: Distribute load across partitions:
-- GOOD: Sharded counter
UPDATE stats
SET page_views = page_views + 1
WHERE stat_name = 'homepage' AND shard = 0; -- Pick shard 0-99 randomly
Best Practices
Choose the Right Consistency Level
- Write-heavy: Use ONE or LOCAL_ONE for maximum throughput
- Balanced: Use LOCAL_QUORUM (most common)
- Read-heavy requiring strong consistency: Use QUORUM or EACH_QUORUM
Design for Distribution
Ensure your partition keys distribute data evenly:
-- BAD: All events for a user in one partition (can grow huge)
CREATE TABLE user_events (
user_id uuid,
timestamp timestamp,
event_type text,
PRIMARY KEY (user_id, timestamp)
);
-- GOOD: Partition by user and date bucket
CREATE TABLE user_events (
user_id uuid,
date_bucket text, -- e.g., "2025-01-15"
timestamp timestamp,
event_type text,
PRIMARY KEY ((user_id, date_bucket), timestamp)
);
Use TTL for Time-Series Data
Instead of manually deleting old data, use TTL:
INSERT INTO sensor_data (sensor_id, timestamp, value)
VALUES (?, ?, ?)
USING TTL 86400; -- Expire after 24 hours
This is much more efficient than DELETE statements.
Monitor and Repair
Run regular repairs to ensure replicas stay synchronized:
# Full repair (expensive)
nodetool repair
# Incremental repair (recommended)
nodetool repair --incremental
Schedule repairs weekly or monthly, depending on your workload.
Footnotes
Cassandra uses a Log-Structured Merge (LSM) tree to hold the data, where writes are first persisted to an append-only CommitLog before being written to an in-memory MemTable.
MemTables are periodically flushed to disk as immutable SSTables, and background compaction merges them to optimize read performance. This classic Write-Ahead Logging (WAL) mechanism ensures durability and high write throughput via sequential disk I/O.
The coordinator node also takes care of writing to all the replicas. In case write to one of the replicas fails or times out, then hinted handoff is leveraged to repair and maintain eventual consistency. Cassandra leverages quorum with tunable consistency levels, allowing trade-offs between consistency and availability for both reads and writes.