Published on 14th Feb 2021

7 min read

Registrations for the July 2021 System Design Cohort are now open Enroll Today →

Persistent Data Structures preserve previous versions of themselves allowing us to revisit and audit any historical version. While we already did an exhaustive introduction in the previous essay, in this essay, we take a detailed look into how we can implement the * Fully Persistent* variant of the most common data structure -

The essay is loosely based on the paper titled Fully Persistent Arrays for Efficient Incremental Updates and Voluminous Reads by Tyng-Ruey Chuang and we implement it using Backer's trick elaborated in the paper A Persistent Union-Find Data Structure.

An array is an abstract data structure consisting of `n`

slots to hold a maximum of `n`

elements such that each element is identified by at least one index. A typical array allows the following functions - `create(n)`

, `update(index, value)`

and `get(index)`

.

The simplest form of an array, that we all are familiar with, is the Linear Array that is designed to hold elements in consecutive memory locations, leveraging spatial locality for faster and more efficient retrievals and scans. Before we jump into the implementation details of Fully Persistent Arrays, let's reiterate what exactly are Fully Persistent Data Structures.

Persistent Data Structures preserve previous versions of themselves allowing us to revisit and audit any historical version. Fully Persistent Data Structures allows access and modification to all the historical versions as well. It does not restrict any modifications whatsoever. This means we can typically revisit any historical version of the data structure, modify it like we are forking out a new branch.

Fully Persistent Arrays are arrays that support Full Persistence which means it supports usual array operations while also allowing us to go back in time and make updates to any of the previous versions. We define the following operations on Fully Persistent Arrays -

`create(n)`

- returns an array of size`n`

having all the slots uninitialized`update(array, index, value)`

- returns a new array identical to`array`

except for the element at the position`index`

. The parent array`array`

remains unaffected and is still accessible.`get(array, index)`

- returns the element present at the index`index`

in array`array`

A naive way of implementing these arrays is to do a Copy-on-Write and keep track of historical versions. This approach very inefficient as it requires `m`

times the memory required to hold `n`

elements, where `m`

is the total number of versions of the array.

A better way of implementing these arrays is by using the Backer's Trick which enables the required functionality with just one array and a tree of modifications.

A more efficient way of implementing Fully Persistent Arrays is by using a single instance of an in-memory Array and in conjunction use a tree of modifications. Instead of storing all the versions separately, Backer's trick allows us to compute any version of the array by replaying all the changes asked for.

The tree of modifications is an `n`

-ary tree that holds all the versions of the array by storing only the modifications made to the elements. Each version is derived from a parent version and the root points to the in-memory *cache* array which holds the initial version of it.

Each node of the tree holds three fields - `index`

, `value`

, and a pointer to the `parent`

, making this tree pointing upwards towards to root. Thus each node holds the changed `value`

, where did the change happen `index`

and on which version the change happened `parent`

.

Say we changed the element at the index `1`

of the array `9, 6, 3, 5, 1`

to `7`

we get array `9, 7, 3, 5, 1`

. The tree of modifications has 2 nodes one root node `a0`

pointing to the initial array, and another node `a1`

denoting the updated version.

The node `a1`

has 3 fields, `index`

set to `1`

, `value`

set to `7`

and `parent`

pointing to `a0`

. The node implies that it was derived from `a0`

by changing the value of the element at the index `1`

to `7`

. If we try to branch off `a0`

with another change say index `4`

set to value `9`

we would have 3 nodes in the tree. Thus we see how an update translates into just creating a new node and adding it at the right place in the tree.

Now we see with this design how we implement the three functions of an array `create`

, `update`

, and `get`

.

`create`

The `create`

function allocates a linear array of size `n`

to hold `n`

elements. This is a usual array allocation. While doing this we also create the root node of our tree of modifications. The root node, as established earlier, points to the *cache* array.

```
# The function creates a new persistent array of size `n`
def create(n: int) -> Array:
# 1. allocate in-memory cache
# 2. initialize the tree of modifications
# 3. make the root of the tree point to the cache
pass
```

The overall complexity of this operation is `O(n)`

space and `O(n)`

time.

`update`

The `update`

operation takes in the `index`

that needs to be updated, the `value`

, and the version of the array on which update is to be made.

```
# The function updates the element at the index `index` with value
# `value` on array `array` and returns the newly updated array
# keeping the old one accessible.
def update(array: Array, index: int, value: object) -> Array:
# 1. create a node in the tree and store index, the value in it
# 2. point this new node to the parent array
pass
```

To do this efficiently, we create a new node in the tree whose parent is set to the array version on which update is performed, index and value are set what was passed during invocation. Thus we see that the `update`

operation takes a constant `O(1)`

space and `O(1)`

time to create and represent a new version of the array.

With the update operation being made efficient we have to trade-off `get`

operation.

`get`

The `get`

operation takes in the `index`

that needs to be fetched and the version of the array from which the element is to be fetched. The `get`

operation seeks no extra space but takes time proportional to the distance between the array version and the root. In the worst case, this distance will be as long as the total number of versions of the array.

```
# The function fetches the element from index `index`
# from the array `array` and returns it.
def get(array: Array, index: int) -> object:
# 1. Start from the requested array and traverse to the root node.
# 2. Allocate a new register to store the requested value
# 3. During traversal, if the node.index == `index` update the
# register with the value.
# 4. return the value of the register
pass
```

The overall complexity of this operation is `O(1)`

space and `O(n)`

time.

We established that the update operation takes constant time and reads are expensive. If our system is write-heavy, then this is pretty handy but if the system has more reads then operation taking `O(n)`

time hampers the overall performance of the system. So as to optimize this use case we take a look at the operation called *Rerooting.*

The initial array (the first version of the array) has no significance to be the root forever. We can reroot the entire tree such that any child node could become the root and the value it points to - *cache* - represents the true copy of the array. Rerooting is a sequence of rotations to make the desired array version the root.

The algorithm for rerooting is a classic Backtracking algorithm that requires updates in all the nodes coming in the path from the old node to the new node.

The rerooting operation takes time proportional to the distance between the old and new root ~ `O(n)`

. Since the successive reads are happening on the same version the `get`

operation becomes `O(1)`

as well. Thus depending on the kind of usage of the system we can add rerooting step in either `get`

or `update`

operation.

If you like what you read subscribe you can always subscribe to my newsletter and get the post delivered straight to your inbox. I write essays on various engineering topics and share it through my weekly newsletter 👇

Mistaken Beliefs of Distributed Systems

In this essay, we learn about a set of false assumptions that programmers new to distributed applica...

17th JunThe most popular implementation of the LFU Cache Eviction Scheme, using a min-heap, implements all t...

23rd AugPersistent Data Structures - An Introduction

Persistent Data Structures allow us to hold multiple versions of a data structure at any given insta...

7th FebPhi φ Accrual Failure Detection

Phi φ Accrual Failure Detection algorithm, unlike conventional algorithms, is an adaptive failure de...

12th Jul