How Writes Work in Apache Cassandra

Apache Cassandra is a distributed database designed for high availability and horizontal scalability. This write-up explores the complete write path in Cassandra, from the moment a client sends a write request to how data gets replicated across nodes in the cluster.

When a client writes data to Cassandra, the data flows through multiple stages before being safely stored. The write path involves several key components working together to ensure durability, consistency, and performance. Let’s break this down…

Client Connection and Coordinator Selection

Every write request in Cassandra starts with a client connecting to any node in the cluster. The node that receives the client request becomes the coordinator for that operation. This is an important concept: there is no “master” node in Cassandra—any node can coordinate any request.

The coordinator’s role is to:

Accept the write request from the client
Determine which nodes should store replicas of the data
Forward the write to those replica nodes
Wait for acknowledgments based on the consistency level
Respond to the client

The coordinator doesn’t necessarily store the data itself, though it might be one of the replica nodes depending on the partition key and replication strategy.

Determining Replica Nodes

Before the coordinator can forward writes, it needs to determine which nodes should store the data. This decision is based on three key factors.

Partition Key and Token

Every piece of data in Cassandra has a partition key. This key is hashed using a partitioner (typically Murmur3Partitioner) to produce a token—a 64-bit integer that determines where the data lives in the cluster.

Partition Key → Hash Function → Token → Node(s)

For example, if you have a table storing user data:

CREATE TABLE users (  
    user_id uuid PRIMARY KEY,  
    name text,  
    email text  
);

When you insert a user with user_id = '550e8400-e29b-41d4-a716-446655440000', Cassandra hashes this UUID to produce a token. This token falls within a range owned by specific nodes in the cluster.

Token Ring and Consistent Hashing

Cassandra organizes nodes in a logical token ring. The entire token space (from -2^63 to 2^63-1 for Murmur3) is divided among the nodes in the cluster. Each node is assigned a token range and is responsible for storing data whose tokens fall within that range.

When you add or remove nodes, the token ranges are redistributed, but only a fraction of the data needs to move—this is the beauty of consistent hashing.

Replication Strategy

The replication strategy determines how many copies of the data exist and where they’re placed. There are two main strategies:

SimpleStrategy: Used for single data center deployments. It places replicas on consecutive nodes in the ring. For example, with a replication factor of 3, the data is stored on the first node determined by the token, plus the next two nodes clockwise in the ring.

NetworkTopologyStrategy: Used for multi-data center deployments. It allows you to specify how many replicas should exist in each data center. For example:

CREATE KEYSPACE my_keyspace WITH replication = {  
    'class': 'NetworkTopologyStrategy',  
    'datacenter1': 3,  
    'datacenter2': 2  
};

This creates 3 replicas in datacenter1 and 2 in datacenter2. Within each data center, replicas are placed on different racks when possible to maximize availability.

The Write Path Within a Node

Once the coordinator determines which nodes should receive the write, it forwards the mutation to those nodes. Now let’s see what happens inside each replica node when it receives a write request.

CommitLog

The very first thing that happens when a node receives a write is that it appends the mutation to the CommitLog. This is a critical step for durability.

The CommitLog is an append-only log file on disk. It’s structured as a sequential write, which is extremely fast—modern SSDs can handle hundreds of thousands of sequential writes per second. The CommitLog entry contains:

The keyspace and table name
The partition key
The clustering keys (if any)
The column values
Timestamp and TTL information

By writing to the CommitLog first, Cassandra ensures that even if the node crashes immediately after, the write can be recovered when the node restarts. This is the Write-Ahead Log (WAL) pattern.

CommitLog configuration considerations:

# In cassandra.yaml  
commitlog_sync: batch  
commitlog_sync_batch_window_in_ms: 2  
commitlog_segment_size_in_mb: 32  
commitlog_compression:  
    - class_name: LZ4Compressor

The commitlog_sync parameter has two modes:

periodic: Syncs to disk every N milliseconds (default 10s). Faster but less durable.
batch: Syncs after collecting writes for N milliseconds (default 2ms). More durable.

Most production systems use batch mode with a 2ms window, providing a good balance between durability and performance.

MemTable

Immediately after writing to the CommitLog, the mutation is written to an in-memory structure called a MemTable. Each table in Cassandra has its own MemTable.

The MemTable is essentially a sorted map structure (similar to a Red-Black tree or Skip List) that keeps data sorted by partition key and clustering columns. This sorting is crucial for efficient reads and for the later flush to disk.

Multiple writes to the same partition key will update the MemTable in place. However, Cassandra doesn’t actually update data—it writes new timestamped versions. When you “update” a column, you’re really adding a new entry with a newer timestamp. The reconciliation happens at read time.

Example of MemTable organization:

Partition Key: user_123  
├─ Clustering: timestamp=2025-01-15T10:00:00, value="xxxx"  
├─ Clustering: timestamp=2025-01-15T10:05:00, value="xxxxxxxx"  
└─ Clustering: timestamp=2025-01-15T10:10:00, value="xxxxxx"

Partition Key: user_456  
├─ Clustering: timestamp=2025-01-15T10:02:00, value="xxxxxxxxxx"  
└─ Clustering: timestamp=2025-01-15T10:12:00, value="xxxxxxxxx"

Write Response

Once the write is in the CommitLog and MemTable, the node sends an acknowledgment back to the coordinator. This happens very quickly, typically in microseconds, because it only involves an append to the CommitLog and an in-memory update.

At this point, the write is considered successful from the replica node’s perspective, but the coordinator hasn’t responded to the client yet. That depends on the consistency level.

Flushing MemTables to SSTables

The MemTable is memory-bound, so it can’t grow indefinitely. When a MemTable reaches a certain size threshold or after a certain time period, it’s flushed to disk as an SSTable (Sorted String Table).

The flush process:

The MemTable is marked as immutable (no new writes)
A new MemTable is created for incoming writes
The immutable MemTable is sorted and written to disk as an SSTable
The corresponding CommitLog segments are marked for deletion

SSTables are immutable once written. This immutability provides several benefits:

No need for read-write locks
Simplified backup and recovery
Efficient sequential disk access
Easy to distribute across nodes

However, immutability means we accumulate multiple SSTables over time, which is why compaction is necessary.

Consistency Levels

The consistency level determines how many replica nodes must acknowledge a write before the coordinator responds to the client. This is configurable per query and represents a fundamental trade-off between consistency, availability, and latency.

ONE

The coordinator waits for acknowledgment from just one replica node. This provides the lowest latency and highest availability - the write succeeds as long as any single replica node is reachable. However, you might read stale data if you subsequently read from a node that hasn’t received the write yet.

QUORUM

The coordinator waits for acknowledgments from a majority of replica nodes. For a replication factor of 3, QUORUM means 2 nodes must acknowledge. This is the most commonly used level in production because it provides a good balance:

QUORUM = floor(replication_factor / 2) + 1
Guarantees that reads and writes overlap (if you read with QUORUM, you’ll see the most recent QUORUM write)
Can tolerate the failure of a minority of replicas

LOCAL_QUORUM

Similar to QUORUM, but only counts replicas in the local data center. This is crucial for multi-datacenter deployments because you don’t want to wait for cross-datacenter network latency on every write.

ALL

The coordinator waits for all replica nodes to acknowledge. This provides the strongest consistency but has the worst availability; if even one replica is down, writes fail. Generally not recommended for production.

ANY

This is an interesting case. The write is considered successful if it can be written to at least one node, or if it can be stored as a hint for a temporarily unavailable node. More on hints below.

Here’s an Example

Consider a cluster with a replication factor of 3, and you issue a write with consistency level QUORUM:

# Using the Python driver  
from cassandra.cluster import Cluster  
from cassandra import ConsistencyLevel

cluster = Cluster(['10.0.0.1'])  
session = cluster.connect('my_keyspace')

# Prepare a statement with QUORUM consistency  
statement = session.prepare("""  
    INSERT INTO users (user_id, name, email)  
    VALUES (?, ?, ?)  
""")  
statement.consistency_level = ConsistencyLevel.QUORUM

# Execute the write  
session.execute(statement, (user_id, "Alice", "alice@example.com"))

Timeline:

Client sends write to Node A (coordinator)
Node A determines replicas: Nodes B, C, and D
Node A forwards write to Nodes B, C, and D
Within 1-2ms, Nodes B and C respond (they’ve written to CommitLog and MemTable)
Node A has 2 acknowledgments (QUORUM satisfied)
Node A responds success to the client
Node D responds later (but the coordinator already replied to the client)

The entire operation typically completes in 2-5ms, depending on network latency and load.

Data Replication

Now let’s dive deeper into how data actually gets replicated across nodes. The replication happens synchronously as part of the write operation; there’s no separate background replication process for normal writes.

When the coordinator receives a write, it immediately forwards the mutation to all replica nodes (as determined by the replication strategy and replication factor). These writes happen in parallel. The coordinator just waits as per the consistency level, but it still writes to all the replicas synchronously.

Here’s what actually gets sent over the network:

Mutation Message:  
- Keyspace: "my_keyspace"  
- Table: "users"  
- Partition Key: user_id = UUID('...')  
- Timestamp: 1705320000000000 (microseconds since epoch)  
- Columns:  
  - name: "Alice"  
  - email: "alice@example.com"  
  - TTL: null

Each replica node processes this mutation independently through its own CommitLog → MemTable path.

Timestamps and Conflict Resolution

Cassandra uses last-write-wins (LWW) conflict resolution based on timestamps. Every mutation includes a timestamp (in microseconds), and when multiple versions of the same data exist, the one with the highest timestamp wins.

Timestamps can come from two sources:

Client-provided timestamps: Using the USING TIMESTAMP clause:

INSERT INTO users (user_id, name) VALUES (?, ?)  
USING TIMESTAMP 1705320000000000;

Coordinator-generated timestamps: If the client doesn’t provide one, the coordinator assigns a timestamp based on its local clock.

If your cluster’s clocks are not synchronized (using NTP), you can get unexpected results. A write with timestamp 1000 can be overwritten by a later write with timestamp 999 if it arrives from a node with a clock that’s behind.

Hence, always use NTP to keep clocks synchronized across your cluster, with clock drift kept under 500ms.

Handling Network Partitions

In a distributed system, network partitions are inevitable. Cassandra handles these gracefully through its consistency model and hinted handoff mechanism.

Suppose we have RF=3 (replication factor), CL=QUORUM, and Node C goes down:

Coordinator forwards write to Nodes A, B, and C
Nodes A and B respond successfully
Node C doesn’t respond (it’s down)
Coordinator has 2 responses = QUORUM satisfied
Write succeeds from the client’s perspective
Coordinator stores a hint for Node C

Hinted Handoff

A hint is a special write that says, “when Node C comes back online, replay this mutation to it.” Hints are stored on the coordinator (or other available nodes) in a system table.

Hint structure:  
- Target node: C  
- Mutation: INSERT INTO users (user_id, name) VALUES (...)  
- Expiration: 3 hours (configurable via max_hint_window_in_ms)

When Node C comes back online, other nodes detect it and start replaying hints to it. This brings Node C up to date with the writes it missed while down.

Important limitations of hints:

Hints are best-effort, not guaranteed
If a node is down for longer than max_hint_window_in_ms (default 3 hours), hints expire
Hints increase the load on the coordinator
For extended outages, we run a repair instead

Read Repair

Even with a hinted handoff, replicas can get out of sync. Cassandra uses read repair to detect and fix inconsistencies during read operations.

When you read with a consistency level higher than ONE, the coordinator sends read requests to multiple replicas and compares their responses. If it finds differences, it:

Returns the most recent data to the client (highest timestamp wins)
Sends the most recent data to replicas with stale data
Those replicas update themselves in the background

This happens transparently and helps maintain eventual consistency. We can also enable read_repair_chance for additional background repair:

ALTER TABLE users WITH read_repair_chance = 0.1;

This tells Cassandra to perform read repair on 10% of reads, even if the consistency level is ONE. However, this feature is often disabled in modern Cassandra (3.0+) in favor of explicit repair operations.

Write Performance Characteristics

Write Throughput

Cassandra is write-optimized. A single node can handle 10,000-50,000 writes per second, depending on:

Hardware (SSD vs HDD, CPU cores, RAM)
Data model (wide vs narrow partitions)
Consistency level (ONE vs QUORUM vs ALL)
Batch size (single writes vs batched writes)

The write path is designed for speed:

Sequential CommitLog writes (no random I/O)
In-memory MemTable updates (nanosecond latency)
Parallel replication (no waiting for sequential replication)

Write Latency

Typical write latencies (p99):

Consistency Level ONE: 1-3ms
Consistency Level QUORUM: 2-5ms
Consistency Level LOCAL_QUORUM: 2-5ms (within a datacenter)
Consistency Level ALL: 5-20ms (depends on slowest node)

These are remarkably consistent because writes don’t involve disk reads—only sequential appends to the CommitLog.

Factors Affecting Write Performance

Consistency Level

Higher consistency levels increase latency because the coordinator must wait for more acknowledgments.

Batch Writes

Cassandra supports batches, but they’re not always a performance win:

BEGIN BATCH  
    INSERT INTO users (user_id, name) VALUES (uuid1, 'Alice');  
    INSERT INTO users (user_id, name) VALUES (uuid2, 'Bob');  
    INSERT INTO users (user_id, name) VALUES (uuid3, 'Charlie');  
APPLY BATCH;

Logged batches (default) use a batch log for atomicity, which adds overhead. They’re useful for ensuring multiple writes succeed or fail together, but they’re slower than individual writes.

Unlogged batches skip the batch log and are faster, but they don’t guarantee atomicity. They’re only useful for performance when writing to the same partition:

BEGIN UNLOGGED BATCH  
    INSERT INTO user_events (user_id, timestamp, event) VALUES (uuid1, t1, 'login');  
    INSERT INTO user_events (user_id, timestamp, event) VALUES (uuid1, t2, 'purchase');  
    INSERT INTO user_events (user_id, timestamp, event) VALUES (uuid1, t3, 'logout');  
APPLY BATCH;

Partition Size

Writing to the same partition repeatedly can create “hot spots.” Cassandra performs best when writes are distributed across many partitions. If you have a counter table that updates the same partition millions of times, you’ll experience performance degradation.

Hardware

SSD vs HDD: SSDs provide much better write performance (CommitLog especially)
RAM: Larger MemTables reduce flush frequency
Network: Low-latency networks reduce replication overhead

Compaction

As SSTables accumulate, read performance degrades (you have to check more files) and disk space increases (duplicate and deleted data). Compaction solves this by merging SSTables and removing deleted or overwritten data.

Compaction Strategies

SizeTieredCompactionStrategy (STCS)

Default strategy
Groups SSTables of similar size and merges them
Good for write-heavy workloads
Can create large temporary disk spikes (requires 50% free space)

LeveledCompactionStrategy (LCS)

Organizes SSTables into levels of increasing size
Each level is 10x larger than the previous
Better read performance (fewer SSTables per read)
More I/O intensive (more frequent compactions)
Good for read-heavy workloads

TimeWindowCompactionStrategy (TWCS)

Designed for time-series data
Groups SSTables by time window (e.g., daily, hourly)
Old windows are never compacted with newer ones
Perfect for time-series data with TTL
Very efficient for workloads where old data expires

Impact on Write Performance

Compaction runs in the background but competes for I/O resources. Heavy compaction can temporarily impact write performance. You can tune compaction with:

# In cassandra.yaml  
compaction_throughput_mb_per_sec: 16   # Limit compaction I/O  
concurrent_compactors: 2               # Number of parallel compactions

Multi-Datacenter Replication

Cassandra deployments often span multiple datacenters for disaster recovery and geographical distribution. This is how replication works across datacenters.

NetworkTopologyStrategy Configuration

CREATE KEYSPACE global_app WITH replication = {  
    'class': 'NetworkTopologyStrategy',  
    'us-east': 3,  
    'us-west': 3,  
    'eu-west': 2  
};

This creates:

3 replicas in us-east datacenter
3 replicas in us-west datacenter
2 replicas in eu-west datacenter
Total of 8 replicas globally

Cross-Datacenter Write Flow

When a write arrives at a coordinator in us-east with consistency level LOCAL_QUORUM:

The coordinator determines all 8 replica nodes
3 in us-east
3 in us-west
2 in eu-west
Coordinator immediately forwards the write to all 8 nodes in parallel
Coordinator waits for 2 acknowledgments from us-east replicas (LOCAL_QUORUM for RF=3)
Coordinator responds to client (~5ms total)
Acknowledgments from us-west and eu-west arrive later (~50-200ms depending on distance)
All replicas now have the data

Cross-datacenter replication is still synchronous (the write is sent immediately), but LOCAL_QUORUM doesn’t wait for remote datacenters. This provides consistency within a datacenter and eventual consistency across datacenters.

Consistency Level Choice in Multi-DC

LOCAL_QUORUM: Most common for production

Fast (doesn’t wait for remote DCs)
Consistent within the local DC
Eventually consistent across DCs
Can tolerate an entire DC failure

EACH_QUORUM: Stronger consistency

Waits for QUORUM in every DC
Much higher latency (wait for all DCs)
Guarantees strong consistency globally
Use when you need immediate global consistency

Failure Scenarios and Handling

Scenario 1: Single Node Failure (RF=3, CL=QUORUM)

Cluster with nodes A, B, C, D. Node C fails. RF=3, so data has replicas on A, B, and C.

Write behavior:

Coordinator sends write to A, B, C
A and B respond (QUORUM=2 satisfied)
C doesn’t respond
Write succeeds
Hint stored for C

No impact on write availability. Writes continues successfully.

Scenario 2: Multiple Node Failures (RF=3, CL=QUORUM)

Same cluster, nodes B and C both fail.

Write behavior:

Coordinator sends write to A, B, C
Only A responds
QUORUM not satisfied (need 2, have 1)
Write fails with Unavailable exception

Write availability is lost for data replicated to B and C. However, data replicated to other sets of nodes (e.g., A, D, E) continues working.

Scenario 3: Network Partition

Cluster split into two groups: {A, B} and {C, D}. RF=3, CL=QUORUM.

Write behavior:

Group {A, B}: Can’t reach QUORUM (need 2, but C is unreachable) → Writes fail
Group {C, D}: Same problem → Writes fail
Result: Write availability lost for affected partitions

When the network heals, read repair and explicit repair (nodetool repair) restore consistency.

Scenario 4: Datacenter Failure (Multi-DC)

Setup: Three datacenters: US (RF=3), EU (RF=3), ASIA (RF=2). US datacenter goes offline.

Write behavior with LOCAL_QUORUM in EU:

Coordinator in EU
Sends writes to EU replicas
Gets QUORUM from EU replicas
Write succeeds

No impact on EU writes. US writes fail. ASIA applications can continue if they write to EU or ASIA with LOCAL_QUORUM.

Common Pitfalls

Using ALL Consistency Level

Problem: Sacrifices availability for consistency. Any single node failure causes writes to fail.

Solution: Use QUORUM or LOCAL_QUORUM instead. They provide strong consistency while tolerating failures.

Large Batches

Problem: Batching unrelated writes hurts performance and can overwhelm coordinators.

-- BAD: Batching writes to different partitions  
BEGIN BATCH  
    INSERT INTO users (user_id, name) VALUES (uuid1, 'Alice');  
    INSERT INTO products (product_id, name) VALUES (uuid2, 'Widget');  
    INSERT INTO orders (order_id, total) VALUES (uuid3, 100);  
APPLY BATCH;

Solution: Only batch writes to the same partition, or use unlogged batches when atomicity isn’t needed.

Using Client Timestamps Inconsistently

Problem: Mixing client-side and server-side timestamps leads to unpredictable conflict resolution.

Solution: Either always use USING TIMESTAMP or never use it. Be consistent across your application.

Ignoring NTP

Problem: Clock drift causes last-write-wins to produce unexpected results.

Solution: Keep NTP running on all nodes with drift under 500ms. Monitor clock sync regularly.

Writing to Hot Partitions

Problem: Repeatedly updating the same partition creates a bottleneck.

-- BAD: Global counter  
UPDATE stats SET page_views = page_views + 1 WHERE stat_name = 'homepage';

Solution: Distribute load across partitions:

-- GOOD: Sharded counter  
UPDATE stats  
SET page_views = page_views + 1  
WHERE stat_name = 'homepage' AND shard = 0;  -- Pick shard 0-99 randomly

Best Practices

Choose the Right Consistency Level

Write-heavy: Use ONE or LOCAL_ONE for maximum throughput
Balanced: Use LOCAL_QUORUM (most common)
Read-heavy requiring strong consistency: Use QUORUM or EACH_QUORUM

Design for Distribution

Ensure your partition keys distribute data evenly:

-- BAD: All events for a user in one partition (can grow huge)  
CREATE TABLE user_events (  
    user_id uuid,  
    timestamp timestamp,  
    event_type text,  
    PRIMARY KEY (user_id, timestamp)  
);

-- GOOD: Partition by user and date bucket  
CREATE TABLE user_events (  
    user_id uuid,  
    date_bucket text,  -- e.g., "2025-01-15"  
    timestamp timestamp,  
    event_type text,  
    PRIMARY KEY ((user_id, date_bucket), timestamp)  
);

Use TTL for Time-Series Data

Instead of manually deleting old data, use TTL:

INSERT INTO sensor_data (sensor_id, timestamp, value)  
VALUES (?, ?, ?)  
USING TTL 86400;  -- Expire after 24 hours

This is much more efficient than DELETE statements.

Monitor and Repair

Run regular repairs to ensure replicas stay synchronized:

# Full repair (expensive)  
nodetool repair

# Incremental repair (recommended)  
nodetool repair --incremental

Schedule repairs weekly or monthly, depending on your workload.

Footnotes

Cassandra uses a Log-Structured Merge (LSM) tree to hold the data, where writes are first persisted to an append-only CommitLog before being written to an in-memory MemTable.

MemTables are periodically flushed to disk as immutable SSTables, and background compaction merges them to optimize read performance. This classic Write-Ahead Logging (WAL) mechanism ensures durability and high write throughput via sequential disk I/O.

The coordinator node also takes care of writing to all the replicas. In case write to one of the replicas fails or times out, then hinted handoff is leveraged to repair and maintain eventual consistency. Cassandra leverages quorum with tunable consistency levels, allowing trade-offs between consistency and availability for both reads and writes.